Sequence-Structure Embeddings via Protein Language Models Improve on Prediction Tasks

被引：1

作者：

Kabir, Anowarul ^{[1
]}

Shehu, Amarda ^{[1
]}

机构：

[1] George Mason Univ, Dept Comp Sci, Fairfax, VA 22030 USA

来源：

2022 IEEE INTERNATIONAL CONFERENCE ON KNOWLEDGE GRAPH (ICKG) | 2022年

关键词：

D O I：

10.1109/ICKG55886.2022.00021

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Building on the transformer architecture and its revolutionizing of language models for natural language processing, protein language models (PLMs) are now emerging as a powerful tool for learning over large numbers of sequences in protein sequence databases and linking protein sequence to function. PLMs are shown to learn useful, task-agnostic sequence representations that allow predicting protein secondary structure, protein subcellular localization, and evolutionary relationships within protein families. However, existing models are strictly trained over protein sequences and miss an opportunity to leverage and integrate the information present in heterogeneous data sources. In this paper, inspired by the intrinsic role of three-dimensional/tertiary protein structure in determining a broad range of protein properties, we propose a PLM that integrates and attends to both protein sequence and tertiary structure. In particular, this paper posits that learning joint sequence-structure representations yields better representations for function-related prediction tasks. A detailed experimental evaluation shows that such joint sequence-structure representations are more powerful than sequence-based representations, yield better performance on superfamily membership across various metrics, and capture interesting relationships in the PLM-learned embedding space.

引用

页码：105 / 112

页数：8

共 50 条

[1] The sequence-structure relationship and protein function prediction
Sadowski, M. I.
Jones, D. T.
CURRENT OPINION IN STRUCTURAL BIOLOGY, 2009, 19 (03) : 357 - 362
[2] Pre-Training Protein Encoder via Siamese Sequence-Structure Diffusion Trajectory Prediction
Zhang, Zuobai
Xu, Minghao
Lozano, Aurelie
Chenthamarakshan, Vijil
Das, Payel
Tang, Jian
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[3] Single-sequence protein structure prediction by integrating protein language models
Jing, Xiaoyang
Wu, Fandi
Luo, Xiao
Xu, Jinbo
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2024, 121 (13)
[4] Deciphering globular protein sequence-structure relationships: from observation to prediction
Poupon, A
Mornon, JP
THEORETICAL CHEMISTRY ACCOUNTS, 2001, 106 (1-2) : 113 - 120
[5] Integrating Embeddings from Multiple Protein Language Models to Improve Protein O-GlcNAc Site Prediction
Pokharel, Suresh
Pratyush, Pawel
Ismail, Hamid D.
Ma, Junfeng
Dukka, B. K. C.
INTERNATIONAL JOURNAL OF MOLECULAR SCIENCES, 2023, 24 (21)
[6] Measurements of protein sequence-structure correlations
Crooks, GE
Wolfe, J
Brenner, SE
PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2004, 57 (04) : 804 - 810
[7] A Bayes-optimal sequence-structure theory that unifies protein sequence-structure recognition and alignment
Richard H. Lathrop
Robert G. Rogers
Temple F. Smith
James V. White
Bulletin of Mathematical Biology, 1998, 60 (6) : 1039 - 1071
[8] A Bayes-optimal sequence-structure theory that unifies protein sequence-structure recognition and alignment
Lathrop, RH
Rogers, RG
Smith, TF
White, JV
BULLETIN OF MATHEMATICAL BIOLOGY, 1998, 60 (06) : 1039 - 1071
[9] Servers for sequence-structure relationship analysis and prediction
Dosztányi, Z
Magyar, C
Tusnády, GE
Cserzo, M
Fiser, A
Simon, I
NUCLEIC ACIDS RESEARCH, 2003, 31 (13) : 3359 - 3363
[10] Modeling protein loops with knowledge-based prediction of sequence-structure alignment
Peng, Hung-Pin
Yang, An-Suei
BIOINFORMATICS, 2007, 23 (21) : 2836 - 2842

← 1 2 3 4 5 →