Large language models generate functional protein sequences across diverse families

被引:354
|
作者
Madani, Ali [1 ,2 ]
Ben Krause, Ben [1 ]
Greene, Eric R. [3 ]
Subramanian, Subu [4 ,5 ]
Mohr, Benjamin P. [6 ]
Holton, James M. [7 ,8 ,9 ]
Olmos, Jose Luis [3 ]
Xiong, Caiming [1 ]
Sun, Zachary Z. Z. [6 ]
Socher, Richard [1 ]
Fraser, James S. [3 ]
Naik, Nikhil [1 ]
机构
[1] Salesforce Res, Palo Alto, CA 94301 USA
[2] Profluent Bio, San Francisco, CA 94118 USA
[3] Univ Calif San Francisco, Dept Bioengn & Therapeut Sci, San Francisco, CA USA
[4] Univ Calif Berkeley, Dept Mol & Cell Biol, Berkeley, CA USA
[5] Univ Calif Berkeley, Howard Hughes Med Inst, Berkeley, CA USA
[6] Tierra Biosci, San Leandro, CA USA
[7] Lawrence Berkeley Natl Lab, Mol Biophys & Integrated Bioimaging Div, Berkeley, CA USA
[8] SLAC Natl Accelerator Lab, Stanford Synchrotron Radiat Lightsource, Menlo Pk, CA USA
[9] Univ Calif San Francisco, Dept Biochem & Biophys, San Francisco, CA USA
基金
美国国家卫生研究院;
关键词
STRUCTURE REFINEMENT; T4; LYSOZYME; CONTACTS;
D O I
10.1038/s41587-022-01618-2
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
A generative deep-learning model designs artificial proteins with desired enzymatic activities. Deep-learning language models have shown promise in various biotechnological applications, including protein design and engineering. Here we describe ProGen, a language model that can generate protein sequences with a predictable function across large protein families, akin to generating grammatically and semantically correct natural language sentences on diverse topics. The model was trained on 280 million protein sequences from >19,000 families and is augmented with control tags specifying protein properties. ProGen can be further fine-tuned to curated sequences and tags to improve controllable generation performance of proteins from families with sufficient homologous samples. Artificial proteins fine-tuned to five distinct lysozyme families showed similar catalytic efficiencies as natural lysozymes, with sequence identity to natural proteins as low as 31.4%. ProGen is readily adapted to diverse protein families, as we demonstrate with chorismate mutase and malate dehydrogenase.
引用
收藏
页码:1099 / +
页数:17
相关论文
共 50 条
  • [21] The Use of Large Language Models to Generate Education Materials about Uveitis
    Kianian, Reza
    Sun, Deyu
    Crowell, Eric L.
    Tsui, Edmund
    OPHTHALMOLOGY RETINA, 2024, 8 (02): : 195 - 201
  • [22] Evaluating the Application of Large Language Models to Generate Feedback in Programming Education
    Jacobs, Sven
    Jaschke, Steffen
    2024 IEEE GLOBAL ENGINEERING EDUCATION CONFERENCE, EDUCON 2024, 2024,
  • [23] UPSEC: An algorithm for classifying unaligned protein sequences into functional families
    Ma, Patrick C. H.
    Chan, Keith C. C.
    JOURNAL OF COMPUTATIONAL BIOLOGY, 2008, 15 (04) : 431 - 443
  • [24] Persuading across Diverse Domains: A Dataset and Persuasion Large Language Model
    Jin, Chuhao
    Ren, Kening
    Kong, Lingzhen
    Wang, Xiting
    Song, Ruihua
    Chen, Huan
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 1678 - 1706
  • [25] Fusing AI: Multimodal Language Models Inference Across Diverse Inputs
    Jovanovic, Mladan
    Campbell, Mark
    COMPUTER, 2024, 57 (11) : 124 - 130
  • [26] Visual Comparison of Text Sequences Generated by Large Language Models
    Sevastjanova, Rita
    Vogelbacher, Simon
    Spitz, Andreas
    Keim, Daniel
    El-Assady, Mennatallah
    2023 IEEE VISUALIZATION IN DATA SCIENCE, VDS, 2023, : 11 - 20
  • [27] Leveraging Large Language Models to Generate Clinical Histories for Oncologic Imaging Requisitions
    Bhayana, Rajesh
    Alwahbi, Omar
    Ladak, Aly Muhammad
    Deng, Yangqing
    Dias, Adriano Basso
    Elbanna, Khaled
    Gomez, Jorge Abreu
    Jajodia, Ankush
    Jhaveri, Kartik
    Johnson, Sarah
    Kajal, Dilkash
    Wang, David
    Soong, Christine
    Kielar, Ania
    Krishna, Satheesh
    RADIOLOGY, 2025, 314 (02)
  • [28] Instruct Large Language Models to Generate Scientific Literature Survey Step by Step
    Lai, Yuxuan
    Wu, Yupeng
    Wang, Yidan
    Hu, Wenpeng
    Zheng, Chen
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT V, NLPCC 2024, 2025, 15363 : 484 - 496
  • [29] BADASP: predicting functional specificity in protein families using ancestral sequences
    Edwards, RJ
    Shields, DC
    BIOINFORMATICS, 2005, 21 (22) : 4190 - 4191
  • [30] Evaluating Diverse Large Language Models for Automatic and General Bug Reproduction
    Kang, Sungmin
    Yoon, Juyeon
    Askarbekkyzy, Nargiz
    Yoo, Shin
    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2024, 50 (10) : 2677 - 2694