PLM_Sol: predicting protein solubility by benchmarking multiple protein language models with the updated Escherichia coli protein solubility dataset

被引：1

作者：

Zhang, Xuechun ^{[1
,2
,3
]}

Hu, Xiaoxuan ^{[1
,2
,3
]}

Zhang, Tongtong ^{[1
,2
,3
]}

Yang, Ling ^{[1
,2
,3
]}

Liu, Chunhong ^{[1
,2
,3
]}

Xu, Ning ^{[1
,2
,3
]}

Wang, Haoyi ^{[1
,2
,3
,4
]}

Sun, Wen ^{[1
,2
,4
]}

机构：

[1] Chinese Acad Sci, Inst Zool, Key Lab Organ Regenerat & Reconstruct, State Key Lab Stem Cell & Reprod Biol, 1 Beichen West Rd, Beijing 100101, Peoples R China

[2] Chinese Acad Sci, Inst Stem Cell & Regenerat, 1Beichen West Rd, Beijing 100101, Peoples R China

[3] Univ Chinese Acad Sci, 1 Yanqihu East Rd, Beijing 101408, Peoples R China

[4] Beijing Inst Stem Cell & Regenerat Med, A 3 Datun Rd, Beijing 100100, Peoples R China

来源：

BRIEFINGS IN BIOINFORMATICS | 2024年 / 25卷 / 05期

基金：

中国国家自然科学基金;

关键词：

protein solubility prediction; protein language models; enzymes of interest; SEQUENCE-BASED PREDICTION; DISCOVERY;

D O I：

10.1093/bib/bbae404

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Protein solubility plays a crucial role in various biotechnological, industrial, and biomedical applications. With the reduction in sequencing and gene synthesis costs, the adoption of high-throughput experimental screening coupled with tailored bioinformatic prediction has witnessed a rapidly growing trend for the development of novel functional enzymes of interest (EOI). High protein solubility rates are essential in this process and accurate prediction of solubility is a challenging task. As deep learning technology continues to evolve, attention-based protein language models (PLMs) can extract intrinsic information from protein sequences to a greater extent. Leveraging these models along with the increasing availability of protein solubility data inferred from structural database like the Protein Data Bank holds great potential to enhance the prediction of protein solubility. In this study, we curated an Updated Escherichia coli protein Solubility DataSet (UESolDS) and employed a combination of multiple PLMs and classification layers to predict protein solubility. The resulting best-performing model, named Protein Language Model-based protein Solubility prediction model (PLM_Sol), demonstrated significant improvements over previous reported models, achieving a notable 6.4% increase in accuracy, 9.0% increase in F1_score, and 11.1% increase in Matthews correlation coefficient score on the independent test set. Moreover, additional evaluation utilizing our in-house synthesized protein resource as test data, encompassing diverse types of enzymes, also showcased the good performance of PLM_Sol. Overall, PLM_Sol exhibited consistent and promising performance across both independent test set and experimental set, thereby making it well suited for facilitating large-scale EOI studies. PLM_Sol is available as a standalone program and as an easy-to-use model at https://zenodo.org/doi/10.5281/zenodo.10675340.

引用

页数：10

共 50 条

[41] Significant Enhanced Expression and Solubility of Human Proteins in Escherichia coli by Fusion with Protein S from Myxococcus xanthus
Kobayashi, Hiroshi
Yoshida, Takeshi
Inouye, Masayori
APPLIED AND ENVIRONMENTAL MICROBIOLOGY, 2009, 75 (16) : 5356 - 5362
[42] Protein solubility and differential proteomic profiling of recombinant Escherichia coli overexpressing double-tagged fusion proteins
Cheng, Chung-Hsien
Lee, Wen-Chien
MICROBIAL CELL FACTORIES, 2010, 9
[43] MULTIPLE SECA PROTEIN ISOFORMS IN ESCHERICHIA-COLI
LIEBKE, HH
JOURNAL OF BACTERIOLOGY, 1987, 169 (03) : 1174 - 1181
[44] A Relationship between mRNA Expression Levels and Protein Solubility in E. coli
Tartaglia, Gian Gaetano
Pechmann, Sebastian
Dobson, Christopher M.
Vendruscolo, Michele
JOURNAL OF MOLECULAR BIOLOGY, 2009, 388 (02) : 381 - 389
[45] Predicting the protein solubility by integrating chaos games representation and entropy in information theory
Niu Xiaohui
Shi Feng
Hu Xuehai
Xia Jingbo
Li Nana
EXPERT SYSTEMS WITH APPLICATIONS, 2014, 41 (04) : 1672 - 1679
[46] Tetrahymena thermophila granule lattice protein 3 improves solubility of sexual stage malaria antigens expressed in Escherichia coli
Akkale, Cengiz
Cassidy-Hanley, Donna Marie
Clark, Theodore G.
PROTEIN EXPRESSION AND PURIFICATION, 2022, 194
[47] Fusion to Tetrahymena thermophila granule lattice protein 1 confers solubility to sexual stage malaria antigens in Escherichia coli
Agrawal, Alka
Bisharyan, Yelena
Papoyan, Ashot
Bednenko, Janna
Cardarelli, Joanna
Yao, Monica
Clark, Theodore
Berkmen, Mehmet
Ke, Na
Colussi, Paul
PROTEIN EXPRESSION AND PURIFICATION, 2019, 153 : 7 - 17
[48] Energy functions for protein design: Adjustment with protein-protein complex affinities, models for the unfolded state, and negative design of solubility and specificity
Pokala, N
Handel, TM
JOURNAL OF MOLECULAR BIOLOGY, 2005, 347 (01) : 203 - 227
[49] Usage of a dataset of NMR resolved protein structures to test aggregation versus solubility prediction algorithms
Roche, Daniel B.
Villain, Etienne
Kajava, Andrey V.
PROTEIN SCIENCE, 2017, 26 (09) : 1864 - 1869
[50] Characterization of fish lens crystallins: models of protein adaptation for solubility and cold tolerance
Durham, Andrew D.
Rocha, Megan A.
Norton-Baker, Brenna
Martin, Rachel W.
BIOPHYSICAL JOURNAL, 2022, 121 (03) : 454A - 454A

← 1 2 3 4 5 →