A method for multiple-sequence-alignment-free protein structure prediction using a protein language model

被引:36
|
作者
Fang, Xiaomin [1 ]
Wang, Fan [1 ]
Liu, Lihang [1 ]
He, Jingzhou [1 ]
Lin, Dayong [1 ]
Xiang, Yingfei [1 ]
Zhu, Kunrui [1 ]
Zhang, Xiaonan [1 ]
Wu, Hua [1 ]
Li, Hui [2 ]
Song, Le [2 ]
机构
[1] Baidu Inc, NLP, Shenzhen, Peoples R China
[2] BioMap, Beijing, Peoples R China
关键词
34;
D O I
10.1038/s42256-023-00721-6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Protein structure prediction pipelines based on artificial intelligence, such as AlphaFold2, have achieved near-experimental accuracy. These advanced pipelines mainly rely on multiple sequence alignments (MSAs) as inputs to learn the co-evolution information from the homologous sequences. Nonetheless, searching MSAs from protein databases is time consuming, usually taking tens of minutes. Consequently, we attempt to explore the limits of fast protein structure prediction by using only primary structures of proteins. Our proposed method, HelixFold-Single, combines a large-scale protein language model with the superior geometric learning capability of AlphaFold2. HelixFold-Single first pre-trains a large-scale protein language model with thousands of millions of primary structures utilizing the self-supervised learning paradigm, which will be used as an alternative to MSAs for learning the co-evolution information. Then, by combining the pre-trained protein language model and the essential components of AlphaFold2, we obtain an end-to-end differentiable model to predict the three-dimensional coordinates of atoms from only the primary structure. HelixFold-Single is validated on datasets CASP14 and CAMEO, achieving competitive accuracy with the MSA-based methods on targets with large homologous families. Furthermore, HelixFold-Single consumes much less time than the mainstream pipelines for protein structure prediction, demonstrating its potential in tasks requiring many predictions. AlphaFold2 has revolutionized bioinformatics, but its ability to predict protein structures with high accuracy comes at the price of a costly database search for multiple sequence alignments. Fang and colleagues pre-train a large-scale protein language model and use it in conjunction with AlphaFold2 as a fully trainable and efficient model for structure prediction.
引用
收藏
页码:1087 / 1096
页数:10
相关论文
共 50 条
  • [1] A method for multiple-sequence-alignment-free protein structure prediction using a protein language model
    Xiaomin Fang
    Fan Wang
    Lihang Liu
    Jingzhou He
    Dayong Lin
    Yingfei Xiang
    Kunrui Zhu
    Xiaonan Zhang
    Hua Wu
    Hui Li
    Le Song
    Nature Machine Intelligence, 2023, 5 : 1087 - 1096
  • [2] Integrating protein secondary structure prediction and multiple sequence alignment
    Simossis, VA
    Heringa, J
    CURRENT PROTEIN & PEPTIDE SCIENCE, 2004, 5 (04) : 249 - 266
  • [3] Protein language-model embeddings for fast, accurate, and alignment-free protein structure prediction
    Weissenow, Konstantin
    Heinzinger, Michael
    Rost, Burkhard
    STRUCTURE, 2022, 30 (08) : 1169 - +
  • [4] Protein multiple sequence alignment benchmarking through secondary structure prediction
    Le, Quan
    Sievers, Fabian
    Higgins, Desmond G.
    BIOINFORMATICS, 2017, 33 (09) : 1331 - 1337
  • [5] THE LIMITS OF PROTEIN SECONDARY STRUCTURE PREDICTION ACCURACY FROM MULTIPLE SEQUENCE ALIGNMENT
    RUSSELL, RB
    BARTON, GJ
    JOURNAL OF MOLECULAR BIOLOGY, 1993, 234 (04) : 951 - 957
  • [6] Application of multiple sequence alignment profiles to improve protein secondary structure prediction
    Cuff, JA
    Barton, GJ
    PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2000, 40 (03) : 502 - 511
  • [7] Single-sequence protein structure prediction using a language model and deep learning
    Ratul Chowdhury
    Nazim Bouatta
    Surojit Biswas
    Christina Floristean
    Anant Kharkar
    Koushik Roy
    Charlotte Rochereau
    Gustaf Ahdritz
    Joanna Zhang
    George M. Church
    Peter K. Sorger
    Mohammed AlQuraishi
    Nature Biotechnology, 2022, 40 : 1617 - 1623
  • [8] Single-sequence protein structure prediction using a language model and deep learning
    Chowdhury, Ratul
    Bouatta, Nazim
    Biswas, Surojit
    Floristean, Christina
    Kharkare, Anant
    Roye, Koushik
    Rochereau, Charlotte
    Ahdritz, Gustaf
    Zhang, Joanna
    Church, George M.
    Sorger, Peter K.
    AlQuraishi, Mohammed
    NATURE BIOTECHNOLOGY, 2022, 40 (11) : 1617 - +
  • [9] Multiple protein sequence alignment
    Pei, Jimin
    CURRENT OPINION IN STRUCTURAL BIOLOGY, 2008, 18 (03) : 382 - 386
  • [10] A structure-based method for protein sequence alignment
    Kann, MG
    Thiessen, PA
    Panchenko, AR
    Schäffer, AA
    Altschul, SF
    Bryant, SH
    BIOINFORMATICS, 2005, 21 (08) : 1451 - 1456