Sequence-Structure Embeddings via Protein Language Models Improve on Prediction Tasks

被引:1
|
作者
Kabir, Anowarul [1 ]
Shehu, Amarda [1 ]
机构
[1] George Mason Univ, Dept Comp Sci, Fairfax, VA 22030 USA
关键词
D O I
10.1109/ICKG55886.2022.00021
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Building on the transformer architecture and its revolutionizing of language models for natural language processing, protein language models (PLMs) are now emerging as a powerful tool for learning over large numbers of sequences in protein sequence databases and linking protein sequence to function. PLMs are shown to learn useful, task-agnostic sequence representations that allow predicting protein secondary structure, protein subcellular localization, and evolutionary relationships within protein families. However, existing models are strictly trained over protein sequences and miss an opportunity to leverage and integrate the information present in heterogeneous data sources. In this paper, inspired by the intrinsic role of three-dimensional/tertiary protein structure in determining a broad range of protein properties, we propose a PLM that integrates and attends to both protein sequence and tertiary structure. In particular, this paper posits that learning joint sequence-structure representations yields better representations for function-related prediction tasks. A detailed experimental evaluation shows that such joint sequence-structure representations are more powerful than sequence-based representations, yield better performance on superfamily membership across various metrics, and capture interesting relationships in the PLM-learned embedding space.
引用
收藏
页码:105 / 112
页数:8
相关论文
共 50 条
  • [1] The sequence-structure relationship and protein function prediction
    Sadowski, M. I.
    Jones, D. T.
    CURRENT OPINION IN STRUCTURAL BIOLOGY, 2009, 19 (03) : 357 - 362
  • [2] Pre-Training Protein Encoder via Siamese Sequence-Structure Diffusion Trajectory Prediction
    Zhang, Zuobai
    Xu, Minghao
    Lozano, Aurelie
    Chenthamarakshan, Vijil
    Das, Payel
    Tang, Jian
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [3] Single-sequence protein structure prediction by integrating protein language models
    Jing, Xiaoyang
    Wu, Fandi
    Luo, Xiao
    Xu, Jinbo
    PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2024, 121 (13)
  • [4] Deciphering globular protein sequence-structure relationships: from observation to prediction
    Poupon, A
    Mornon, JP
    THEORETICAL CHEMISTRY ACCOUNTS, 2001, 106 (1-2) : 113 - 120
  • [5] Integrating Embeddings from Multiple Protein Language Models to Improve Protein O-GlcNAc Site Prediction
    Pokharel, Suresh
    Pratyush, Pawel
    Ismail, Hamid D.
    Ma, Junfeng
    Dukka, B. K. C.
    INTERNATIONAL JOURNAL OF MOLECULAR SCIENCES, 2023, 24 (21)
  • [6] Measurements of protein sequence-structure correlations
    Crooks, GE
    Wolfe, J
    Brenner, SE
    PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2004, 57 (04) : 804 - 810
  • [7] A Bayes-optimal sequence-structure theory that unifies protein sequence-structure recognition and alignment
    Richard H. Lathrop
    Robert G. Rogers
    Temple F. Smith
    James V. White
    Bulletin of Mathematical Biology, 1998, 60 (6) : 1039 - 1071
  • [8] A Bayes-optimal sequence-structure theory that unifies protein sequence-structure recognition and alignment
    Lathrop, RH
    Rogers, RG
    Smith, TF
    White, JV
    BULLETIN OF MATHEMATICAL BIOLOGY, 1998, 60 (06) : 1039 - 1071
  • [9] Servers for sequence-structure relationship analysis and prediction
    Dosztányi, Z
    Magyar, C
    Tusnády, GE
    Cserzo, M
    Fiser, A
    Simon, I
    NUCLEIC ACIDS RESEARCH, 2003, 31 (13) : 3359 - 3363
  • [10] Modeling protein loops with knowledge-based prediction of sequence-structure alignment
    Peng, Hung-Pin
    Yang, An-Suei
    BIOINFORMATICS, 2007, 23 (21) : 2836 - 2842