Large language models generate functional protein sequences across diverse families

被引:354
|
作者
Madani, Ali [1 ,2 ]
Ben Krause, Ben [1 ]
Greene, Eric R. [3 ]
Subramanian, Subu [4 ,5 ]
Mohr, Benjamin P. [6 ]
Holton, James M. [7 ,8 ,9 ]
Olmos, Jose Luis [3 ]
Xiong, Caiming [1 ]
Sun, Zachary Z. Z. [6 ]
Socher, Richard [1 ]
Fraser, James S. [3 ]
Naik, Nikhil [1 ]
机构
[1] Salesforce Res, Palo Alto, CA 94301 USA
[2] Profluent Bio, San Francisco, CA 94118 USA
[3] Univ Calif San Francisco, Dept Bioengn & Therapeut Sci, San Francisco, CA USA
[4] Univ Calif Berkeley, Dept Mol & Cell Biol, Berkeley, CA USA
[5] Univ Calif Berkeley, Howard Hughes Med Inst, Berkeley, CA USA
[6] Tierra Biosci, San Leandro, CA USA
[7] Lawrence Berkeley Natl Lab, Mol Biophys & Integrated Bioimaging Div, Berkeley, CA USA
[8] SLAC Natl Accelerator Lab, Stanford Synchrotron Radiat Lightsource, Menlo Pk, CA USA
[9] Univ Calif San Francisco, Dept Biochem & Biophys, San Francisco, CA USA
基金
美国国家卫生研究院;
关键词
STRUCTURE REFINEMENT; T4; LYSOZYME; CONTACTS;
D O I
10.1038/s41587-022-01618-2
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
A generative deep-learning model designs artificial proteins with desired enzymatic activities. Deep-learning language models have shown promise in various biotechnological applications, including protein design and engineering. Here we describe ProGen, a language model that can generate protein sequences with a predictable function across large protein families, akin to generating grammatically and semantically correct natural language sentences on diverse topics. The model was trained on 280 million protein sequences from >19,000 families and is augmented with control tags specifying protein properties. ProGen can be further fine-tuned to curated sequences and tags to improve controllable generation performance of proteins from families with sufficient homologous samples. Artificial proteins fine-tuned to five distinct lysozyme families showed similar catalytic efficiencies as natural lysozymes, with sequence identity to natural proteins as low as 31.4%. ProGen is readily adapted to diverse protein families, as we demonstrate with chorismate mutase and malate dehydrogenase.
引用
收藏
页码:1099 / +
页数:17
相关论文
共 50 条
  • [41] Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
    Biderman, Stella
    Schoelkopf, Hailey
    Anthony, Quentin
    Bradley, Herbie
    O'Brien, Kyle
    Hallahan, Eric
    Khan, Mohammad Aflah
    Purohit, Shivanshu
    Prashanth, U. S. V. S. N. Sai
    Raff, Edward
    Skowron, Aviya
    Sutawika, Lintang
    van der Wal, Oskar
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 202, 2023, 202
  • [42] Enhancing functional gene set analysis with large language models
    Hu, Mengzhou
    Pratt, Dexter
    NATURE METHODS, 2025, 22 (01) : 22 - 23
  • [43] Using Large Language Models to Generate Script Concordance Test in Medical Education: ChatGPT and Claude
    Kiyak, Yavuz Selim
    Emekli, Emre
    SPANISH JOURNAL OF MEDICAL EDUCATION, 2025, 6 (01):
  • [44] Using Large Language Models to Generate and Apply Contingency Handling Procedures in Collaborative Assembly Applications
    Ka, Jeon Ho
    Dhanaraj, Neel
    Wadaskar, Siddhant
    Gupta, Satyandra K.
    2024 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2024), 2024, : 15585 - 15592
  • [45] Leveraging Large Language Models to Generate Course-Specific Semantically Annotated Learning Objects
    Lohr, Dominic
    Berges, Marc
    Chugh, Abhishek
    Kohlhase, Michael
    Mueller, Dennis
    JOURNAL OF COMPUTER ASSISTED LEARNING, 2025, 41 (01)
  • [46] Evolutionary velocity with protein language models predicts evolutionary dynamics of diverse proteins
    Hie, Brian L.
    Yang, Kevin K.
    Kim, Peter S.
    CELL SYSTEMS, 2022, 13 (04) : 274 - +
  • [47] NanoAbLLaMA: construction of nanobody libraries with protein large language models
    Wang, Xin
    Chen, Haotian
    Chen, Bo
    Liang, Lixin
    Mei, Fengcheng
    Huang, Bingding
    FRONTIERS IN CHEMISTRY, 2025, 13
  • [48] When Protein Structure Embedding Meets Large Language Models
    Ali, Sarwan
    Chourasia, Prakash
    Patterson, Murray
    GENES, 2024, 15 (01)
  • [49] Automatically Correcting Large Language Models: Surveying the Landscape of Diverse Automated Correction Strategies
    Pan, Liangming
    Saxon, Michael
    Xu, Wenda
    Nathani, Deepak
    Wang, Xinyi
    Wang, William Yang
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2024, 12 : 484 - 506
  • [50] Tracing the Influence of Large Language Models across the Most Impactful Scientific Works
    Petrosanu, Dana-Mihaela
    Pirjan, Alexandru
    Tabusca, Alexandru
    ELECTRONICS, 2023, 12 (24)