Clustering and Alignment: Understanding the Training Dynamics in Modular Addition
— INTERACTIVE VISUALIZATIONS —
Visualizing the Training Dynamics in Modular Addition
Below we visualize how a simplified transformer with constant attention learns to add two numbers modulo N = 17. The model works as follows: first, the input numbers are embedded using 2-dimensional embedding vectors (left). Then, a constant attention is applied by adding together the embedding vectors of the two input numbers (markers, right). Finally, two linear layers are applied to predict the output number modulo N (background, right).
The hidden layer has size H = 64 and ReLU activation. Markers and background follow the same color scheme as the embeddings. We do not show the markers corresponding to pairs of numbers reserved for validation. Use the buttons below to control the training process.
Epoch
0
Train Loss
N/A
Valid Acc
N/A
Weight Decay
Learning Rate
Validation Data
Particle Simulation
Below we visualize a particle simulation with N = 17 particles. The particles are initialized with random positions and their evolution is determined by the clustering and alignment forces between pairs of particles, as described in the paper. Use the buttons below to control the simulation.
Step
0
Alignment Force
Step Size