Learning the language of proteins
When
Where
Speaker: Claire McWhite, Molecular & Cellular Biology, University of Arizona
Title: Learning the language of proteins
Abstract: We study the fundamental connection between protein amino acid sequences and protein function. A protein language model is a dynamic data structure that has learned the “syntax” of protein sequences. By conceptualizing protein sequences as a language— where each amino acid represents a unique ‘word’—we study the intricate ‘grammar’ that governs how these sequences encode physical properties and functional information.
Our work introduces novel techniques for both annotating and modifying protein properties, based on studying how structural/functional information is stored, categorized, and transmitted within pretrained protein language models. We demonstrate how a model capable of "understanding" the logic of protein sequences can serve as an interpreter and aid for biology research, revealing useful information about a protein's function even in the absence of known annotations. We also use this "interpreter" to study the functional effects of sequence variation across evolution. We highlight instances where our predictions have revealed new insights into biological systems.