Clustering and Alignment: Understanding the Training Dynamics in Modular Addition

— INTERACTIVE VISUALIZATIONS —

Visualizing the Training Dynamics in Modular Addition

Below we visualize how a simplified transformer with constant attention learns to add two numbers modulo N = 17. The model works as follows: first, the input numbers are embedded using 2-dimensional embedding vectors (left). Then, a constant attention is applied by adding together the embedding vectors of the two input numbers (markers, right). Finally, two linear layers are applied to predict the output number modulo N (background, right).

The hidden layer has size H = 64 and ReLU activation. Markers and background follow the same color scheme as the embeddings. We do not show the markers corresponding to pairs of numbers reserved for validation. Use the buttons below to control the training process.

Epoch

Train Loss

N/A

Valid Acc

N/A

Weight Decay

Learning Rate

Validation Data

Particle Simulation

Below we visualize a particle simulation with N = 17 particles. The particles are initialized with random positions and their evolution is determined by the clustering and alignment forces between pairs of particles, as described in the paper. Use the buttons below to control the simulation.

Step

Alignment Force

Step Size