Large language models generate functional protein sequences across diverse families

被引:354
|
作者
Madani, Ali [1 ,2 ]
Ben Krause, Ben [1 ]
Greene, Eric R. [3 ]
Subramanian, Subu [4 ,5 ]
Mohr, Benjamin P. [6 ]
Holton, James M. [7 ,8 ,9 ]
Olmos, Jose Luis [3 ]
Xiong, Caiming [1 ]
Sun, Zachary Z. Z. [6 ]
Socher, Richard [1 ]
Fraser, James S. [3 ]
Naik, Nikhil [1 ]
机构
[1] Salesforce Res, Palo Alto, CA 94301 USA
[2] Profluent Bio, San Francisco, CA 94118 USA
[3] Univ Calif San Francisco, Dept Bioengn & Therapeut Sci, San Francisco, CA USA
[4] Univ Calif Berkeley, Dept Mol & Cell Biol, Berkeley, CA USA
[5] Univ Calif Berkeley, Howard Hughes Med Inst, Berkeley, CA USA
[6] Tierra Biosci, San Leandro, CA USA
[7] Lawrence Berkeley Natl Lab, Mol Biophys & Integrated Bioimaging Div, Berkeley, CA USA
[8] SLAC Natl Accelerator Lab, Stanford Synchrotron Radiat Lightsource, Menlo Pk, CA USA
[9] Univ Calif San Francisco, Dept Biochem & Biophys, San Francisco, CA USA
基金
美国国家卫生研究院;
关键词
STRUCTURE REFINEMENT; T4; LYSOZYME; CONTACTS;
D O I
10.1038/s41587-022-01618-2
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
A generative deep-learning model designs artificial proteins with desired enzymatic activities. Deep-learning language models have shown promise in various biotechnological applications, including protein design and engineering. Here we describe ProGen, a language model that can generate protein sequences with a predictable function across large protein families, akin to generating grammatically and semantically correct natural language sentences on diverse topics. The model was trained on 280 million protein sequences from >19,000 families and is augmented with control tags specifying protein properties. ProGen can be further fine-tuned to curated sequences and tags to improve controllable generation performance of proteins from families with sufficient homologous samples. Artificial proteins fine-tuned to five distinct lysozyme families showed similar catalytic efficiencies as natural lysozymes, with sequence identity to natural proteins as low as 31.4%. ProGen is readily adapted to diverse protein families, as we demonstrate with chorismate mutase and malate dehydrogenase.
引用
收藏
页码:1099 / +
页数:17
相关论文
共 50 条
  • [1] Large language models generate functional protein sequences across diverse families
    Ali Madani
    Ben Krause
    Eric R. Greene
    Subu Subramanian
    Benjamin P. Mohr
    James M. Holton
    Jose Luis Olmos
    Caiming Xiong
    Zachary Z. Sun
    Richard Socher
    James S. Fraser
    Nikhil Naik
    Nature Biotechnology, 2023, 41 : 1099 - 1106
  • [2] Structure of the space of folding protein sequences defined by large language models
    Zambon, A.
    Zecchina, R.
    Tiana, G.
    PHYSICAL BIOLOGY, 2024, 21 (02)
  • [3] Toward a Holistic Performance Evaluation of Large Language Models Across Diverse AI Accelerators
    Emani, Murali
    Foreman, Sam
    Sastry, Varuni
    Xie, Zhen
    Raskar, Siddhisanket
    Arnold, William
    Thakur, Rajeev
    Vishwanath, Venkatram
    Papka, Michael E.
    Shanmugavelu, Sanjif
    Gandhi, Darshan
    Zhao, Hengyu
    Ma, Dun
    Ranganath, Kiran
    Weisner, Rick
    Chen, Jiunn-yeu
    Yang, Yuting
    Vassilieva, Natalia
    Zhang, Bin C.
    Howland, Sylvia
    Tsyplikhin, Alexander
    2024 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS, IPDPSW 2024, 2024, : 48 - 57
  • [4] Enabling Large Language Models to Generate Text with Citations
    Gao, Tianyu
    Yen, Howard
    Yu, Jiatong
    Chen, Danqi
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 6465 - 6488
  • [5] Can large language models generate geospatial code?
    State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan, China
    不详
    arXiv, 1600,
  • [6] Fine-tuning protein language models boosts predictions across diverse tasks
    Schmirler, Robert
    Heinzinger, Michael
    Rost, Burkhard
    NATURE COMMUNICATIONS, 2024, 15 (01)
  • [7] Finding functional motifs in protein sequences with deep learning and natural language models
    Savojardo, Castrense
    Martelli, Pier Luigi
    Casadio, Rita
    CURRENT OPINION IN STRUCTURAL BIOLOGY, 2023, 81
  • [8] CAN LARGE LANGUAGE MODELS GENERATE CONCEPTUAL HEALTH ECONOMIC MODELS?
    Chhatwal, J.
    Yildirim, I
    Balta, D.
    Ermis, T.
    Tenkin, S.
    Samur, S.
    Ayer, T.
    VALUE IN HEALTH, 2024, 27 (06) : S123 - S123
  • [9] OdorAgent: Generate Odor Sequences for Movies Based on Large Language Model
    Zhang, Yu
    Gao, Peizhong
    Kang, Fangzhou
    Li, Jiaxiang
    Liu, Jiacheng
    Lu, Qi
    Xu, Yingqing
    2024 IEEE CONFERENCE ON VIRTUAL REALITY AND 3D USER INTERFACES, VR 2024, 2024, : 105 - 114
  • [10] Exploring Large Language Models to generate Easy to Read content
    Martinez, Paloma
    Ramos, Alberto
    Moreno, Lourdes
    FRONTIERS IN COMPUTER SCIENCE, 2024, 6