Sequence-Structure Embeddings via Protein Language Models Improve on Prediction Tasks

被引:1
|
作者
Kabir, Anowarul [1 ]
Shehu, Amarda [1 ]
机构
[1] George Mason Univ, Dept Comp Sci, Fairfax, VA 22030 USA
关键词
D O I
10.1109/ICKG55886.2022.00021
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Building on the transformer architecture and its revolutionizing of language models for natural language processing, protein language models (PLMs) are now emerging as a powerful tool for learning over large numbers of sequences in protein sequence databases and linking protein sequence to function. PLMs are shown to learn useful, task-agnostic sequence representations that allow predicting protein secondary structure, protein subcellular localization, and evolutionary relationships within protein families. However, existing models are strictly trained over protein sequences and miss an opportunity to leverage and integrate the information present in heterogeneous data sources. In this paper, inspired by the intrinsic role of three-dimensional/tertiary protein structure in determining a broad range of protein properties, we propose a PLM that integrates and attends to both protein sequence and tertiary structure. In particular, this paper posits that learning joint sequence-structure representations yields better representations for function-related prediction tasks. A detailed experimental evaluation shows that such joint sequence-structure representations are more powerful than sequence-based representations, yield better performance on superfamily membership across various metrics, and capture interesting relationships in the PLM-learned embedding space.
引用
收藏
页码:105 / 112
页数:8
相关论文
共 50 条
  • [21] Capturing protein sequence-structure specificity using computational sequence design
    Mach, Paul
    Koehl, Patrice
    PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2013, 81 (09) : 1556 - 1570
  • [22] Exploring the sequence-structure protein landscape in the glycosyltransferase family
    Zhang, ZD
    Kochhar, S
    Grigorov, M
    PROTEIN SCIENCE, 2003, 12 (10) : 2291 - 2302
  • [23] ViTO: tool for refinement of protein sequence-structure alignments
    Catherinot, V
    Labesse, G
    BIOINFORMATICS, 2004, 20 (18) : 3694 - 3696
  • [24] Empirical potential function for simplified protein models: Combining contact and local sequence-structure descriptors
    Zhang, Jinfeng
    Chen, Rong
    Liang, Jie
    PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2006, 63 (04) : 949 - 960
  • [25] THE REPRESENTATION OF STRUCTURE IN SEQUENCE PREDICTION TASKS
    CLEEREMANS, A
    ATTENTION AND PERFORMANCE XV: CONSCIOUS AND NONCONSCIOUS INFORMATION PROCESSING, 1994, 15 : 783 - 809
  • [26] Reverse Transfer Learning: Can Word Embeddings Trained for Different NLP Tasks Improve Neural Language Models?
    Verwimp, Lyan
    Bellegarda, Jerome R.
    INTERSPEECH 2019, 2019, : 3485 - 3489
  • [27] E-pRSA: Embeddings Improve the Prediction of Residue Relative Solvent Accessibility in Protein Sequence
    Manfredi, Matteo
    Savojardo, Castrense
    Martelli, Pier Luigi
    Casadio, Rita
    JOURNAL OF MOLECULAR BIOLOGY, 2024, 436 (17)
  • [28] Superior protein thermophilicity prediction with protein language model embeddings
    Haselbeck, Florian
    John, Maura
    Zhang, Yuqi
    Pirnay, Jonathan
    Fuenzalida-Werner, Juan Pablo
    Costa, Ruben D.
    Grimm, Dominik G.
    NAR GENOMICS AND BIOINFORMATICS, 2023, 5 (04)
  • [29] Fast protein fold recognition and accurate sequence-structure alignment
    Zimmer, R
    Thiele, R
    BIOINFORMATICS, 1997, 1278 : 137 - 146
  • [30] Augmenting Large Language Models via Vector Embeddings to Improve Domain-specific Responsiveness
    Wolfrath, Nathan M.
    Verhagen, Nathaniel B.
    Crotty, Bradley H.
    Somai, Melek
    Kothari, Anai N.
    JOVE-JOURNAL OF VISUALIZED EXPERIMENTS, 2024, (214):