PLM_Sol: predicting protein solubility by benchmarking multiple protein language models with the updated Escherichia coli protein solubility dataset

被引:1
|
作者
Zhang, Xuechun [1 ,2 ,3 ]
Hu, Xiaoxuan [1 ,2 ,3 ]
Zhang, Tongtong [1 ,2 ,3 ]
Yang, Ling [1 ,2 ,3 ]
Liu, Chunhong [1 ,2 ,3 ]
Xu, Ning [1 ,2 ,3 ]
Wang, Haoyi [1 ,2 ,3 ,4 ]
Sun, Wen [1 ,2 ,4 ]
机构
[1] Chinese Acad Sci, Inst Zool, Key Lab Organ Regenerat & Reconstruct, State Key Lab Stem Cell & Reprod Biol, 1 Beichen West Rd, Beijing 100101, Peoples R China
[2] Chinese Acad Sci, Inst Stem Cell & Regenerat, 1Beichen West Rd, Beijing 100101, Peoples R China
[3] Univ Chinese Acad Sci, 1 Yanqihu East Rd, Beijing 101408, Peoples R China
[4] Beijing Inst Stem Cell & Regenerat Med, A 3 Datun Rd, Beijing 100100, Peoples R China
基金
中国国家自然科学基金;
关键词
protein solubility prediction; protein language models; enzymes of interest; SEQUENCE-BASED PREDICTION; DISCOVERY;
D O I
10.1093/bib/bbae404
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Protein solubility plays a crucial role in various biotechnological, industrial, and biomedical applications. With the reduction in sequencing and gene synthesis costs, the adoption of high-throughput experimental screening coupled with tailored bioinformatic prediction has witnessed a rapidly growing trend for the development of novel functional enzymes of interest (EOI). High protein solubility rates are essential in this process and accurate prediction of solubility is a challenging task. As deep learning technology continues to evolve, attention-based protein language models (PLMs) can extract intrinsic information from protein sequences to a greater extent. Leveraging these models along with the increasing availability of protein solubility data inferred from structural database like the Protein Data Bank holds great potential to enhance the prediction of protein solubility. In this study, we curated an Updated Escherichia coli protein Solubility DataSet (UESolDS) and employed a combination of multiple PLMs and classification layers to predict protein solubility. The resulting best-performing model, named Protein Language Model-based protein Solubility prediction model (PLM_Sol), demonstrated significant improvements over previous reported models, achieving a notable 6.4% increase in accuracy, 9.0% increase in F1_score, and 11.1% increase in Matthews correlation coefficient score on the independent test set. Moreover, additional evaluation utilizing our in-house synthesized protein resource as test data, encompassing diverse types of enzymes, also showcased the good performance of PLM_Sol. Overall, PLM_Sol exhibited consistent and promising performance across both independent test set and experimental set, thereby making it well suited for facilitating large-scale EOI studies. PLM_Sol is available as a standalone program and as an easy-to-use model at https://zenodo.org/doi/10.5281/zenodo.10675340.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] NetSolP: predicting protein solubility in Escherichia coli using language models
    Thumuluri, Vineet
    Martiny, Hannah-Marie
    Armenteros, Jose J. Almagro
    Salomon, Jesper
    Nielsen, Henrik
    Johansen, Alexander Rosenberg
    BIOINFORMATICS, 2022, 38 (04) : 941 - 946
  • [2] Protein-Sol: a web tool for predicting protein solubility from sequence
    Hebditch, Max
    Carballo-Amador, M. Alejandro
    Charonis, Spyros
    Curtis, Robin
    Warwicker, Jim
    BIOINFORMATICS, 2017, 33 (19) : 3098 - 3100
  • [3] Bioinformatics approaches for improved recombinant protein production in Escherichia coli: protein solubility prediction
    Chang, Catherine Ching Han
    Song, Jiangning
    Tey, Beng Ti
    Ramanan, Ramakrishnan Nagasundara
    BRIEFINGS IN BIOINFORMATICS, 2014, 15 (06) : 953 - 962
  • [4] Prediction of Protein Solubility in Escherichia coli Using Logistic Regression
    Diaz, Armando A.
    Tomba, Emanuele
    Lennarson, Reese
    Richard, Rex
    Bagajewicz, Miguel J.
    Harrison, Roger G.
    BIOTECHNOLOGY AND BIOENGINEERING, 2010, 105 (02) : 374 - 383
  • [5] Benchmarking protein language models for protein crystallization
    Mall, Raghvendra
    Kaushik, Rahul
    Martinez, Zachary A.
    Thomson, Matt W.
    Castiglione, Filippo
    SCIENTIFIC REPORTS, 2025, 15 (01):
  • [6] Divergent genetic control of protein solubility and conformational quality in Escherichia coli
    Garcia-Fruitos, Elena
    Martinez-Alonso, Monica
    Gonzalez-Montalban, Nuria
    Valli, Minoska
    Mattanovich, Diethard
    Villaverde, Antonio
    JOURNAL OF MOLECULAR BIOLOGY, 2007, 374 (01) : 195 - 205
  • [7] Recombinant protein expression and solubility screening in Escherichia coli:: a comparative study
    Berrow, Nick S.
    Buessow, K.
    Coutard, B.
    Diprose, J.
    Ekberg, M.
    Folkers, G. E.
    Levy, N.
    Lieu, V.
    Owens, R. J.
    Peleg, Y.
    Pinaglia, C.
    Quevillon-Cheruel, S.
    Salim, L.
    Scheich, C.
    Vincentelli, R.
    Busso, Didier
    ACTA CRYSTALLOGRAPHICA SECTION D-BIOLOGICAL CRYSTALLOGRAPHY, 2006, 62 : 1218 - 1226
  • [8] ProG-SOL: Predicting Protein Solubility Using Protein Embeddings and Dual-Graph Convolutional Networks
    Li, Gen
    Zhang, Ning
    Fan, Long
    ACS OMEGA, 2025, 10 (04): : 3910 - 3916
  • [9] Predicting the effects of mutations on protein solubility using graph convolution network and protein language model representation
    Wang, Jing
    Chen, Sheng
    Yuan, Qianmu
    Chen, Jianwen
    Li, Danping
    Wang, Lei
    Yang, Yuedong
    JOURNAL OF COMPUTATIONAL CHEMISTRY, 2024, 45 (08) : 436 - 445
  • [10] Prediction of Protein Solubility in E. coli
    Samak, Taghrid
    Gunter, Dan
    Wang, Zhong
    2012 IEEE 8TH INTERNATIONAL CONFERENCE ON E-SCIENCE (E-SCIENCE), 2012,