ESMSec: Prediction of Secreted Proteins in Human Body Fluids Using Protein Language Models and Attention

被引：0

作者：

Wang, Yan ^{[1
]}

Sun, Huiting ^{[1
]}

Sheng, Nan ^{[1
]}

He, Kai ^{[2
]}

Hou, Wenjv ^{[1
]}

Zhao, Ziqi ^{[1
]}

Yang, Qixing ^{[1
]}

Huang, Lan ^{[1
]}

机构：

[1] Jilin Univ, Minist Educ, Coll Comp Sci & Technol, Key Lab Symbol Computat & Knowledge Engn, Changchun 130012, Peoples R China

[2] Univ Michigan, Dept Computat Med & Bioinformat, Ann Arbor, MI 48103 USA

来源：

INTERNATIONAL JOURNAL OF MOLECULAR SCIENCES | 2024年 / 25卷 / 12期

基金：

中国国家自然科学基金;

关键词：

disease biomarkers; protein language models; multi-head attention; human body fluid; BIOMARKER DISCOVERY;

D O I：

10.3390/ijms25126371

中图分类号：

Q5 [生物化学]; Q7 [分子生物学];

学科分类号：

071010 ; 081704 ;

摘要：

The secreted proteins of human body fluid have the potential to be used as biomarkers for diseases. These biomarkers can be used for early diagnosis and risk prediction of diseases, so the study of secreted proteins of human body fluid has great application value. In recent years, the deep-learning-based transformer language model has transferred from the field of natural language processing (NLP) to the field of proteomics, leading to the development of protein language models (PLMs) for protein sequence representation. Here, we propose a deep learning framework called ESM Predict Secreted Proteins (ESMSec) to predict three types of proteins secreted in human body fluid. The ESMSec is based on the ESM2 model and attention architecture. Specifically, the protein sequence data are firstly put into the ESM2 model to extract the feature information from the last hidden layer, and all the input proteins are encoded into a fixed 1000 x 480 matrix. Secondly, multi-head attention with a fully connected neural network is employed as the classifier to perform binary classification according to whether they are secreted into each body fluid. Our experiment utilized three human body fluids that are important and ubiquitous markers. Experimental results show that ESMSec achieved average accuracy of 0.8486, 0.8358, and 0.8325 on the testing datasets for plasma, cerebrospinal fluid (CSF), and seminal fluid, which on average outperform the state-of-the-art (SOTA) methods. The outstanding performance results of ESMSec demonstrate that the ESM can improve the prediction performance of the model and has great potential to screen the secretion information of human body fluid proteins.

引用

页数：13

共 50 条

[21] Single-sequence protein structure prediction using supervised transformer protein language models
Wang, Wenkai
Peng, Zhenling
Yang, Jianyi
NATURE COMPUTATIONAL SCIENCE, 2022, 2 (12): : 804 - +
[22] Single-sequence protein structure prediction using supervised transformer protein language models
Wenkai Wang
Zhenling Peng
Jianyi Yang
Nature Computational Science, 2022, 2 : 804 - 814
[23] The distribution of factor H family proteins in human body fluids
Jarva, H
Seeberger, H
Jokiranta, TS
Zipfel, PF
Meri, S
Hellwage, J
MOLECULAR IMMUNOLOGY, 2004, 41 (2-3) : 250 - 250
[24] Improved the heterodimer protein complex prediction with protein language models
Chen, Bo
Xie, Ziwei
Qiu, Jiezhong
Ye, Zhaofeng
Xu, Jinbo
Tang, Jie
BRIEFINGS IN BIOINFORMATICS, 2023, 24 (04)
[25] Pair-EGRET: enhancing the prediction of protein-protein interaction sites through graph attention networks and protein language models
Alam, Ramisa
Mahbub, Sazan
Bayzid, Md Shamsuzzoha
BIOINFORMATICS, 2024, 40 (10)
[26] Protein subcellular and secreted localization prediction using deep learning
Zidoum, Hamza
Magdy, Mennatollah
PROCEEDINGS 2018 INTERNATIONAL CONFERENCE ON COMPUTING SCIENCES AND ENGINEERING (ICCSE), 2018,
[27] Protein language models using convolutions
Tang, Lin
NATURE METHODS, 2024, 21 (04) : 550 - 550
[28] DeepLoc 2.1: multi-label membrane protein type prediction using protein language models
Odum, Marius Thrane
Teufel, Felix
Thumuluri, Vineet
Armenteros, Jose Juan Almagro
Johansen, Alexander Rosenberg
Winther, Ole
Nielsen, Henrik
NUCLEIC ACIDS RESEARCH, 2024, 52 (W1) : W215 - W220
[29] PhosBoost: Improved phosphorylation prediction recall using gradient boosting and protein language models
Poretsky, Elly
Andorf, Carson M.
Sen, Taner Z.
PLANT DIRECT, 2023, 7 (12)
[30] DisPredict3.0: Prediction of intrinsically disordered regions/ proteins using protein language model
UI Kabir, Md Wasi
Hoque, Md Tamjidul
APPLIED MATHEMATICS AND COMPUTATION, 2024, 472

← 1 2 3 4 5 →