Structure of the space of folding protein sequences defined by large language models

被引:0
|
作者
Zambon, A. [1 ,2 ]
Zecchina, R. [3 ]
Tiana, G. [1 ,2 ,4 ]
机构
[1] Univ Milan, Dept Phys, Via Celoria 16, I-20133 Milan, Italy
[2] Univ Milan, Ctr Complex & Biosyst, Via Celoria 16, I-20133 Milan, Italy
[3] Bocconi Univ, Via Roentgen 1, I-20136 Milan, Italy
[4] Sez Milano, INFN, Via Celoria 16, I-20133 Milan, Italy
关键词
energy landscape; protein evolution; canonical-ensemble sampling; machine learning; EVOLUTION;
D O I
10.1088/1478-3975/ad205c
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Proteins populate a manifold in the high-dimensional sequence space whose geometrical structure guides their natural evolution. Leveraging recently-developed structure prediction tools based on transformer models, we first examine the protein sequence landscape as defined by an effective energy that is a proxy of sequence foldability. This landscape shares characteristics with optimization challenges encountered in machine learning and constraint satisfaction problems. Our analysis reveals that natural proteins predominantly reside in wide, flat minima within this energy landscape. To investigate further, we employ statistical mechanics algorithms specifically designed to explore regions with high local entropy in relatively flat landscapes. Our findings indicate that these specialized algorithms can identify valleys with higher entropy compared to those found using traditional methods such as Monte Carlo Markov Chains. In a proof-of-concept case, we find that these highly entropic minima exhibit significant similarities to natural sequences, especially in critical key sites and local entropy. Additionally, evaluations through Molecular Dynamics suggests that the stability of these sequences closely resembles that of natural proteins. Our tool combines advancements in machine learning and statistical physics, providing new insights into the exploration of sequence landscapes where wide, flat minima coexist alongside a majority of narrower minima.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Simplified Protein Models: Predicting Folding Pathways and Structure Using Amino Acid Sequences
    Adhikari, Aashish N.
    Freed, Karl F.
    Sosnick, Tobin R.
    PHYSICAL REVIEW LETTERS, 2013, 111 (02)
  • [2] Large language models generate functional protein sequences across diverse families
    Ali Madani
    Ben Krause
    Eric R. Greene
    Subu Subramanian
    Benjamin P. Mohr
    James M. Holton
    Jose Luis Olmos
    Caiming Xiong
    Zachary Z. Sun
    Richard Socher
    James S. Fraser
    Nikhil Naik
    Nature Biotechnology, 2023, 41 : 1099 - 1106
  • [3] Large language models generate functional protein sequences across diverse families
    Madani, Ali
    Ben Krause, Ben
    Greene, Eric R.
    Subramanian, Subu
    Mohr, Benjamin P.
    Holton, James M.
    Olmos, Jose Luis
    Xiong, Caiming
    Sun, Zachary Z. Z.
    Socher, Richard
    Fraser, James S.
    Naik, Nikhil
    NATURE BIOTECHNOLOGY, 2023, 41 (08) : 1099 - +
  • [4] When Protein Structure Embedding Meets Large Language Models
    Ali, Sarwan
    Chourasia, Prakash
    Patterson, Murray
    GENES, 2024, 15 (01)
  • [5] RANDOM SEQUENCES AND PROTEIN FOLDING
    PTITSYN, OB
    JOURNAL OF MOLECULAR STRUCTURE-THEOCHEM, 1985, 24 (1-2): : 45 - 65
  • [6] Models of recognition operators defined in the space of large dimension attributes
    Mirzaev, N. M.
    Khaydarova, M. Yu
    Mirzaeva, G. R.
    Ibragimova, S. N.
    MECHANICAL SCIENCE AND TECHNOLOGY UPDATE (MSTU 2019), 2019, 1260
  • [7] Visual Comparison of Text Sequences Generated by Large Language Models
    Sevastjanova, Rita
    Vogelbacher, Simon
    Spitz, Andreas
    Keim, Daniel
    El-Assady, Mennatallah
    2023 IEEE VISUALIZATION IN DATA SCIENCE, VDS, 2023, : 11 - 20
  • [8] The case for defined protein folding pathways
    Englander, S. Walter
    Mayne, Leland
    PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2017, 114 (31) : 8253 - 8258
  • [9] LENGTHS OF PROTEIN SEQUENCES - EVIDENCE FOR AN EFFECT OF PROTEIN FOLDING CONSTRAINTS ON GENOME STRUCTURE AND EVOLUTION
    WHITE, SH
    FASEB JOURNAL, 1992, 6 (01): : A131 - A131
  • [10] THE ROLE OF PROTEIN FOLDING IN THE EVOLUTION OF PROTEIN SEQUENCES
    STACKHOUSE, T
    ONUFFER, JJ
    MATTHEWS, CR
    AHMED, SA
    MILES, EW
    COLD SPRING HARBOR SYMPOSIA ON QUANTITATIVE BIOLOGY, 1987, 52 : 537 - 544