Structure of the space of folding protein sequences defined by large language models

被引:0
|
作者
Zambon, A. [1 ,2 ]
Zecchina, R. [3 ]
Tiana, G. [1 ,2 ,4 ]
机构
[1] Univ Milan, Dept Phys, Via Celoria 16, I-20133 Milan, Italy
[2] Univ Milan, Ctr Complex & Biosyst, Via Celoria 16, I-20133 Milan, Italy
[3] Bocconi Univ, Via Roentgen 1, I-20136 Milan, Italy
[4] Sez Milano, INFN, Via Celoria 16, I-20133 Milan, Italy
关键词
energy landscape; protein evolution; canonical-ensemble sampling; machine learning; EVOLUTION;
D O I
10.1088/1478-3975/ad205c
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Proteins populate a manifold in the high-dimensional sequence space whose geometrical structure guides their natural evolution. Leveraging recently-developed structure prediction tools based on transformer models, we first examine the protein sequence landscape as defined by an effective energy that is a proxy of sequence foldability. This landscape shares characteristics with optimization challenges encountered in machine learning and constraint satisfaction problems. Our analysis reveals that natural proteins predominantly reside in wide, flat minima within this energy landscape. To investigate further, we employ statistical mechanics algorithms specifically designed to explore regions with high local entropy in relatively flat landscapes. Our findings indicate that these specialized algorithms can identify valleys with higher entropy compared to those found using traditional methods such as Monte Carlo Markov Chains. In a proof-of-concept case, we find that these highly entropic minima exhibit significant similarities to natural sequences, especially in critical key sites and local entropy. Additionally, evaluations through Molecular Dynamics suggests that the stability of these sequences closely resembles that of natural proteins. Our tool combines advancements in machine learning and statistical physics, providing new insights into the exploration of sequence landscapes where wide, flat minima coexist alongside a majority of narrower minima.
引用
收藏
页数:12
相关论文
共 50 条
  • [21] Protein folding in mode space: A collective coordinate approach to structure prediction
    Abseher, R
    Nilges, M
    PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2002, 49 (03) : 365 - 377
  • [22] Revealing the global map of protein folding space by large-scale simulations
    Sinner, Claude
    Lutz, Benjamin
    Verma, Abhinav
    Schug, Alexander
    JOURNAL OF CHEMICAL PHYSICS, 2015, 143 (24):
  • [23] INVESTIGATION OF THE ICD STRUCTURE OF SYSTEMS DEFINED BY STATE-SPACE MODELS
    LEITHEAD, WE
    OREILLY, J
    INTERNATIONAL JOURNAL OF CONTROL, 1994, 60 (01) : 71 - 89
  • [24] Language models can identify enzymatic binding sites in protein sequences
    Teukam, Yves Gaetan Nana
    Dassi, Loic Kwate
    Manica, Matteo
    Probst, Daniel
    Schwaller, Philippe
    Laino, Teodoro
    COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL, 2024, 23 : 1929 - 1937
  • [25] The semi normed space defined by entire sequences
    Subramanian, N.
    Rao, K. Chandrasekhara
    Balasubramanian, K.
    BOLETIM SOCIEDADE PARANAENSE DE MATEMATICA, 2011, 29 (02): : 37 - 41
  • [26] Lattice models of protein folding
    Brass, A
    Fiszer, A
    Clamp, M
    BIOCHEMICAL SOCIETY TRANSACTIONS, 1995, 23 (04) : 715 - 719
  • [27] MODELS OF PROTEIN-FOLDING
    SMITH, TF
    SCIENCE, 1995, 268 (5213) : 959 - 960
  • [28] Diffusion models of protein folding
    Best, Robert B.
    Hummer, Gerhard
    PHYSICAL CHEMISTRY CHEMICAL PHYSICS, 2011, 13 (38) : 16902 - 16911
  • [29] Robustness and generalization of structure-based models for protein folding and function
    Lammert, Heiko
    Schug, Alexander
    Onuchic, Jose N.
    PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2009, 77 (04) : 881 - 891
  • [30] Smoothing Protein Energy Landscapes by Integrating Folding Models with Structure Prediction
    Pritchard-Bell, Ari
    Shell, M. Scott
    BIOPHYSICAL JOURNAL, 2011, 101 (09) : 2251 - 2259