Structure of the space of folding protein sequences defined by large language models

被引:0
|
作者
Zambon, A. [1 ,2 ]
Zecchina, R. [3 ]
Tiana, G. [1 ,2 ,4 ]
机构
[1] Univ Milan, Dept Phys, Via Celoria 16, I-20133 Milan, Italy
[2] Univ Milan, Ctr Complex & Biosyst, Via Celoria 16, I-20133 Milan, Italy
[3] Bocconi Univ, Via Roentgen 1, I-20136 Milan, Italy
[4] Sez Milano, INFN, Via Celoria 16, I-20133 Milan, Italy
关键词
energy landscape; protein evolution; canonical-ensemble sampling; machine learning; EVOLUTION;
D O I
10.1088/1478-3975/ad205c
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Proteins populate a manifold in the high-dimensional sequence space whose geometrical structure guides their natural evolution. Leveraging recently-developed structure prediction tools based on transformer models, we first examine the protein sequence landscape as defined by an effective energy that is a proxy of sequence foldability. This landscape shares characteristics with optimization challenges encountered in machine learning and constraint satisfaction problems. Our analysis reveals that natural proteins predominantly reside in wide, flat minima within this energy landscape. To investigate further, we employ statistical mechanics algorithms specifically designed to explore regions with high local entropy in relatively flat landscapes. Our findings indicate that these specialized algorithms can identify valleys with higher entropy compared to those found using traditional methods such as Monte Carlo Markov Chains. In a proof-of-concept case, we find that these highly entropic minima exhibit significant similarities to natural sequences, especially in critical key sites and local entropy. Additionally, evaluations through Molecular Dynamics suggests that the stability of these sequences closely resembles that of natural proteins. Our tool combines advancements in machine learning and statistical physics, providing new insights into the exploration of sequence landscapes where wide, flat minima coexist alongside a majority of narrower minima.
引用
收藏
页数:12
相关论文
共 50 条
  • [41] Protein folding in contact map space
    Domany, E
    PHYSICA A, 2000, 288 (1-4): : 1 - 9
  • [42] Finding functional motifs in protein sequences with deep learning and natural language models
    Savojardo, Castrense
    Martelli, Pier Luigi
    Casadio, Rita
    CURRENT OPINION IN STRUCTURAL BIOLOGY, 2023, 81
  • [43] Conceptual structure coheres in human cognition but not in large language models
    Suresh, Siddharth
    Mukherjee, Kushin
    Yu, Xizheng
    Huang, Wei-Chun
    Padua, Lisa
    Rogers, Timothy T.
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 722 - 738
  • [44] Response Generated by Large Language Models Depends on the Structure of the Prompt
    Sarangi, Pradosh Kumar
    Mondal, Himel
    INDIAN JOURNAL OF RADIOLOGY AND IMAGING, 2024, 34 (03): : 574 - 575
  • [45] Large Language Models are Not Models of Natural Language: They are Corpus Models
    Veres, Csaba
    IEEE ACCESS, 2022, 10 : 61970 - 61979
  • [46] Large Language Models
    Vargas, Diego Collarana
    Katsamanis, Nassos
    ERCIM NEWS, 2024, (136): : 12 - 13
  • [47] Large Language Models
    Cerf, Vinton G.
    COMMUNICATIONS OF THE ACM, 2023, 66 (08) : 7 - 7
  • [48] Protein structure: Folding and prions
    Rey-Gayo, A
    Torrecilla, FC
    ENFERMEDADES INFECCIOSAS Y MICROBIOLOGIA CLINICA, 2002, 20 (04): : 161 - 167
  • [49] Understanding protein domain-swapping using structure-based models of protein folding
    Mascarenhas, Nahren Manuel
    Gosavi, Shachi
    PROGRESS IN BIOPHYSICS & MOLECULAR BIOLOGY, 2017, 128 : 113 - 120
  • [50] Calibrating and constructing models of protein folding
    Jeffry L. Ramsey
    Synthese, 2007, 155 : 307 - 320