GENERALIST: A latent space based generative model for protein sequence families

被引:1
|
作者
Akl, Hoda [1 ]
Emison, Brooke [2 ]
Zhao, Xiaochuan [1 ]
Mondal, Arup [3 ]
Perez, Alberto [3 ]
Dixit, Purushottam D. [2 ,4 ]
机构
[1] Univ Florida, Dept Phys, Gainesville, FL 33612 USA
[2] Yale Univ, Dept Biomed Engn, New Haven, CT 06520 USA
[3] Univ Florida, Dept Chem, Gainesville, FL USA
[4] Yale Univ, Syst Biol Inst, West Haven, CT 06520 USA
关键词
EXPANSION;
D O I
10.1371/journal.pcbi.1011655
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Generative models of protein sequence families are an important tool in the repertoire of protein scientists and engineers alike. However, state-of-the-art generative approaches face inference, accuracy, and overfitting- related obstacles when modeling moderately sized to large proteins and/or protein families with low sequence coverage. Here, we present a simple to learn, tunable, and accurate generative model, GENERALIST: GENERAtive nonLInear tenSor-factorizaTion for protein sequences. GENERALIST accurately captures several high order summary statistics of amino acid covariation. GENERALIST also predicts conservative local optimal sequences which are likely to fold in stable 3D structure. Importantly, unlike current methods, the density of sequences in GENERALIST-modeled sequence ensembles closely resembles the corresponding natural ensembles. Finally, GENERALIST embeds protein sequences in an informative latent space. GENERALIST will be an important tool to study protein sequence variability. Protein sequence families show tremendous sequence variation. Yet, it is thought that a large portion of the functional sequence space remains unexplored. Generative models are machine learning methods that allow us to learn what makes proteins functional using sequences of naturally occurring proteins. Here, we present a new type of generative model GENERALIST: GENERAtive nonLInear tenSor-factorizaTion for protein sequences that is accurate, easy to implement, and works with very small datasets. We believe that GENERALIST will be an important tool in the repertoire of protein scientists and engineers alike.
引用
收藏
页数:15
相关论文
共 50 条
  • [1] Latent generative landscapes as maps of functional diversity in protein sequence space
    Ziegler, Cheyenne
    Martin, Jonathan
    Sinner, Claude
    Morcos, Faruck
    NATURE COMMUNICATIONS, 2023, 14 (01)
  • [2] Latent generative landscapes as maps of functional diversity in protein sequence space
    Cheyenne Ziegler
    Jonathan Martin
    Claude Sinner
    Faruck Morcos
    Nature Communications, 14
  • [3] Score-based Generative Modeling in Latent Space
    Vahdat, Arash
    Kreis, Karsten
    Kautz, Jan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [4] Exploring the Protein Sequence Space with Global Generative Models
    Romero-Romero, Sergio
    Lindner, Sebastian
    Ferruz, Noelia
    COLD SPRING HARBOR PERSPECTIVES IN BIOLOGY, 2023, 15 (11):
  • [5] Generative geomodeling based on flow responses in latent space
    Jo, Suryeom
    Ahn, Seongin
    Park, Changhyup
    Kim, Jaejun
    JOURNAL OF PETROLEUM SCIENCE AND ENGINEERING, 2022, 211
  • [6] Protein families and TRIBES in genome sequence space
    Enright, AJ
    Kunin, V
    Ouzounis, CA
    NUCLEIC ACIDS RESEARCH, 2003, 31 (15) : 4632 - 4638
  • [7] Illuminating protein space with a programmable generative model
    Ingraham, John B.
    Baranov, Max
    Costello, Zak
    Barber, Karl W.
    Wang, Wujie
    Ismail, Ahmed
    Frappier, Vincent
    Lord, Dana M.
    Ng-Thow-Hing, Christopher
    Van Vlack, Erik R.
    Tie, Shan
    Xue, Vincent
    Cowles, Sarah C.
    Leung, Alan
    Rodrigues, Joao V.
    Morales-Perez, Claudio L.
    Ayoub, Alex M.
    Green, Robin
    Puentes, Katherine
    Oplinger, Frank
    Panwar, Nishant V.
    Obermeyer, Fritz
    Root, Adam R.
    Beam, Andrew L.
    Poelwijk, Frank J.
    Grigoryan, Gevorg
    NATURE, 2023, 623 (7989) : 1070 - +
  • [8] Illuminating protein space with a programmable generative model
    John B. Ingraham
    Max Baranov
    Zak Costello
    Karl W. Barber
    Wujie Wang
    Ahmed Ismail
    Vincent Frappier
    Dana M. Lord
    Christopher Ng-Thow-Hing
    Erik R. Van Vlack
    Shan Tie
    Vincent Xue
    Sarah C. Cowles
    Alan Leung
    João V. Rodrigues
    Claudio L. Morales-Perez
    Alex M. Ayoub
    Robin Green
    Katherine Puentes
    Frank Oplinger
    Nishant V. Panwar
    Fritz Obermeyer
    Adam R. Root
    Andrew L. Beam
    Frank J. Poelwijk
    Gevorg Grigoryan
    Nature, 2023, 623 : 1070 - 1078
  • [9] Clustering of proximal sequence space for the identification of protein families
    Abascal, F
    Valencia, A
    BIOINFORMATICS, 2002, 18 (07) : 908 - 921
  • [10] Latent Space Visualization of Half Face and Full Face by Generative Model
    Zou, Min
    Akashi, Takuya
    FIFTEENTH INTERNATIONAL CONFERENCE ON QUALITY CONTROL BY ARTIFICIAL VISION, 2021, 11794