Applied Math Guest Lecture

(Clustering in) Transformers: a mathematical perspective

When

2:30 – 3:45 p.m., Nov. 15, 2023

Speaker:  Borjan Geshkovski, Mathematics Department, MIT

Abstract:  This talk will report on several results, insights and perspectives Cyril Letrouit, Yury Polyanskiy, Philippe RIgollet and I have found regarding Transformers. We model Transformers as interacting particle systems (each particle representing a token), with a non-linear coupling kernel called self-attention. When considering pure self-attention Transformers, we show that trained representations cluster in long time to different geometric configurations determined by spectral properties of the model weights. We also cover Transformers with layer-normalisation, which amounts to considering the interacting particle system on the sphere. On high-dimensional spheres, we prove that all randomly initialized particles converge to a single cluster. The result is made more precise by describing the precise phase transition between the clustering and non-clustering regimes. The appearance of metastability, and ideas for the low-dimensional regime, will be discussed.

Please login https://arizona.zoom.us/j/89115261848, open, no pwd