Structure of the space of folding protein sequences defined by large language models

被引:0
|
作者
Zambon, A. [1 ,2 ]
Zecchina, R. [3 ]
Tiana, G. [1 ,2 ,4 ]
机构
[1] Univ Milan, Dept Phys, Via Celoria 16, I-20133 Milan, Italy
[2] Univ Milan, Ctr Complex & Biosyst, Via Celoria 16, I-20133 Milan, Italy
[3] Bocconi Univ, Via Roentgen 1, I-20136 Milan, Italy
[4] Sez Milano, INFN, Via Celoria 16, I-20133 Milan, Italy
关键词
energy landscape; protein evolution; canonical-ensemble sampling; machine learning; EVOLUTION;
D O I
10.1088/1478-3975/ad205c
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Proteins populate a manifold in the high-dimensional sequence space whose geometrical structure guides their natural evolution. Leveraging recently-developed structure prediction tools based on transformer models, we first examine the protein sequence landscape as defined by an effective energy that is a proxy of sequence foldability. This landscape shares characteristics with optimization challenges encountered in machine learning and constraint satisfaction problems. Our analysis reveals that natural proteins predominantly reside in wide, flat minima within this energy landscape. To investigate further, we employ statistical mechanics algorithms specifically designed to explore regions with high local entropy in relatively flat landscapes. Our findings indicate that these specialized algorithms can identify valleys with higher entropy compared to those found using traditional methods such as Monte Carlo Markov Chains. In a proof-of-concept case, we find that these highly entropic minima exhibit significant similarities to natural sequences, especially in critical key sites and local entropy. Additionally, evaluations through Molecular Dynamics suggests that the stability of these sequences closely resembles that of natural proteins. Our tool combines advancements in machine learning and statistical physics, providing new insights into the exploration of sequence landscapes where wide, flat minima coexist alongside a majority of narrower minima.
引用
收藏
页数:12
相关论文
共 50 条
  • [11] Design of sequences with good folding properties in coarse-grained protein models
    Irbäck, A
    Peterson, C
    Potthast, F
    Sandelin, E
    STRUCTURE WITH FOLDING & DESIGN, 1999, 7 (03): : 347 - 360
  • [12] Leveraging Large Language Models for Predicting Microbial Virulence from Protein Structure and Sequence
    Quintana, Felix
    Treangen, Todd J.
    Kavraki, Lydia E.
    14TH ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY, AND HEALTH INFORMATICS, BCB 2023, 2023,
  • [13] Images in Language Space: Exploring the Suitability of Large Language Models for Vision & Language Tasks
    Hakimov, Sherzod
    Schlangen, David
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 14196 - 14210
  • [14] SUBTLETIES OF SEQUENCES IN PROTEIN FOLDING AND FUNCTION
    Kini, R. Manjunatha
    TOXICON, 2020, 177 : S7 - S7
  • [15] Simultaneous Alignment and Folding of Protein Sequences
    Waldispuehl, Jerome
    O'Donnell, Charles W.
    Will, Sebastian
    Devadas, Srinivas
    Backofen, Rolf
    Berger, Bonnie
    RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY, PROCEEDINGS, 2009, 5541 : 339 - +
  • [16] Simultaneous Alignment and Folding of Protein Sequences
    Waldispuehl, Jerome
    O'Donnell, Charles W.
    Will, Sebastian
    Devadas, Srinivas
    Backofen, Rolf
    Berger, Bonnie
    JOURNAL OF COMPUTATIONAL BIOLOGY, 2014, 21 (07) : 477 - 491
  • [17] Harmonizing immune cell sequences for computational analysis with large language models
    Alsaafin, Areej
    Tizhoosh, Hamid R.
    BIOLOGY METHODS & PROTOCOLS, 2024, 9 (01):
  • [18] The promises of large language models for protein design and modeling
    Valentini, Giorgio
    Malchiodi, Dario
    Gliozzo, Jessica
    Mesiti, Marco
    Soto-Gomez, Mauricio
    Cabri, Alberto
    Reese, Justin
    Casiraghi, Elena
    Robinson, Peter N.
    FRONTIERS IN BIOINFORMATICS, 2023, 3
  • [19] Coarsely defined solution to the protein folding problem
    Fernández, A
    Burastero, T
    NUOVO CIMENTO DELLA SOCIETA ITALIANA DI FISICA D-CONDENSED MATTER ATOMIC MOLECULAR AND CHEMICAL PHYSICS FLUIDS PLASMAS BIOPHYSICS, 1998, 20 (12): : 1891 - 1910
  • [20] Future Tense L-Space and Large Language Models
    Kaye, Jofish
    COMMUNICATIONS OF THE ACM, 2023, 66 (08) : 116 - 115