Colloquium

Learning the language of proteins

When

3:30 – 4:30 p.m., Sept. 6, 2024

Speaker:  Claire McWhite,  Molecular & Cellular Biology, University of Arizona

Title:  Learning the language of proteins

Abstract:  We study the fundamental connection between protein amino acid sequences and protein function. A protein language model is a dynamic data structure that has learned the “syntax” of protein sequences. By conceptualizing protein sequences as a language— where each amino acid represents a unique ‘word’—we study the intricate ‘grammar’ that governs how these sequences encode physical properties and functional information.

Our work introduces novel techniques for both annotating and modifying protein properties, based on studying how structural/functional information is stored, categorized, and transmitted within pretrained protein language models. We demonstrate how a model capable of "understanding" the logic of protein sequences can serve as an interpreter and aid for biology research, revealing useful information about a protein's function even in the absence of known annotations. We also use this "interpreter" to study the functional effects of sequence variation across evolution. We highlight instances where our predictions have revealed new insights into biological systems.