(Clustering in) Transformers: a mathematical perspective
When
Speaker: Borjan Geshkovski, Mathematics Department, MIT
Abstract: This talk will report on several results, insights and perspectives Cyril Letrouit, Yury Polyanskiy, Philippe RIgollet and I have found regarding Transformers. We model Transformers as interacting particle systems (each particle representing a token), with a non-linear coupling kernel called self-attention. When considering pure self-attention Transformers, we show that trained representations cluster in long time to different geometric configurations determined by spectral properties of the model weights. We also cover Transformers with layer-normalisation, which amounts to considering the interacting particle system on the sphere. On high-dimensional spheres, we prove that all randomly initialized particles converge to a single cluster. The result is made more precise by describing the precise phase transition between the clustering and non-clustering regimes. The appearance of metastability, and ideas for the low-dimensional regime, will be discussed.
Please login https://arizona.zoom.us/j/89115261848, open, no pwd