PLM_Sol: predicting protein solubility by benchmarking multiple protein language models with the updated Escherichia coli protein solubility dataset

被引:1
|
作者
Zhang, Xuechun [1 ,2 ,3 ]
Hu, Xiaoxuan [1 ,2 ,3 ]
Zhang, Tongtong [1 ,2 ,3 ]
Yang, Ling [1 ,2 ,3 ]
Liu, Chunhong [1 ,2 ,3 ]
Xu, Ning [1 ,2 ,3 ]
Wang, Haoyi [1 ,2 ,3 ,4 ]
Sun, Wen [1 ,2 ,4 ]
机构
[1] Chinese Acad Sci, Inst Zool, Key Lab Organ Regenerat & Reconstruct, State Key Lab Stem Cell & Reprod Biol, 1 Beichen West Rd, Beijing 100101, Peoples R China
[2] Chinese Acad Sci, Inst Stem Cell & Regenerat, 1Beichen West Rd, Beijing 100101, Peoples R China
[3] Univ Chinese Acad Sci, 1 Yanqihu East Rd, Beijing 101408, Peoples R China
[4] Beijing Inst Stem Cell & Regenerat Med, A 3 Datun Rd, Beijing 100100, Peoples R China
基金
中国国家自然科学基金;
关键词
protein solubility prediction; protein language models; enzymes of interest; SEQUENCE-BASED PREDICTION; DISCOVERY;
D O I
10.1093/bib/bbae404
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Protein solubility plays a crucial role in various biotechnological, industrial, and biomedical applications. With the reduction in sequencing and gene synthesis costs, the adoption of high-throughput experimental screening coupled with tailored bioinformatic prediction has witnessed a rapidly growing trend for the development of novel functional enzymes of interest (EOI). High protein solubility rates are essential in this process and accurate prediction of solubility is a challenging task. As deep learning technology continues to evolve, attention-based protein language models (PLMs) can extract intrinsic information from protein sequences to a greater extent. Leveraging these models along with the increasing availability of protein solubility data inferred from structural database like the Protein Data Bank holds great potential to enhance the prediction of protein solubility. In this study, we curated an Updated Escherichia coli protein Solubility DataSet (UESolDS) and employed a combination of multiple PLMs and classification layers to predict protein solubility. The resulting best-performing model, named Protein Language Model-based protein Solubility prediction model (PLM_Sol), demonstrated significant improvements over previous reported models, achieving a notable 6.4% increase in accuracy, 9.0% increase in F1_score, and 11.1% increase in Matthews correlation coefficient score on the independent test set. Moreover, additional evaluation utilizing our in-house synthesized protein resource as test data, encompassing diverse types of enzymes, also showcased the good performance of PLM_Sol. Overall, PLM_Sol exhibited consistent and promising performance across both independent test set and experimental set, thereby making it well suited for facilitating large-scale EOI studies. PLM_Sol is available as a standalone program and as an easy-to-use model at https://zenodo.org/doi/10.5281/zenodo.10675340.
引用
收藏
页数:10
相关论文
共 50 条
  • [21] Prediction of solubility on recombinant expression of Plasmodium falciparum erythrocyte membrane protein I domains in Escherichia coli
    Ahuja, Sanjay
    Ahuja, Satpal
    Chen, Qijun
    Wahlgren, Mats
    MALARIA JOURNAL, 2006, 5 (1)
  • [22] Structural properties of UMP-kinase from Escherichia coli: Modulation of protein solubility by pH and UTP
    Serina, L
    Bucurenci, N
    Gilles, AM
    Surewicz, WK
    Fabian, H
    Mantsch, HH
    Takahashi, M
    Petrescu, I
    Batelier, G
    Barzu, O
    BIOCHEMISTRY, 1996, 35 (22) : 7003 - 7011
  • [23] Fusion tags for protein solubility, purification, and immunogenicity in Escherichia coli: the novel Fh8 system
    Costa, Sofia
    Almeida, Andre
    Castro, Antonio
    Domingues, Lucilia
    FRONTIERS IN MICROBIOLOGY, 2014, 5
  • [24] A novel Escherichia coli solubility enhancer protein for fusion expression of aggregation-prone heterologous proteins
    Song, Jong-Am
    Lee, Dae-Sung
    Park, Jin-Seung
    Han, Kyung-Yeon
    Lee, Jeewon
    ENZYME AND MICROBIAL TECHNOLOGY, 2011, 49 (02) : 124 - 130
  • [25] Escherichia coli maltose-binding protein is uncommonly effective at promoting the solubility of polypeptides to which it is fused
    Kapust, RB
    Waugh, DS
    PROTEIN SCIENCE, 1999, 8 (08) : 1668 - 1674
  • [26] OVERPRODUCTION OF BACTERIAL CHAPERONES IMPROVES THE SOLUBILITY OF RECOMBINANT PROTEIN-TYROSINE KINASES IN ESCHERICHIA-COLI
    CASPERS, P
    STIEGER, M
    BURN, P
    CELLULAR AND MOLECULAR BIOLOGY, 1994, 40 (05) : 635 - 644
  • [27] Prediction of solubility on recombinant expression of Plasmodium falciparum erythrocyte membrane protein 1 domains in Escherichia coli
    Sanjay Ahuja
    Satpal Ahuja
    Qijun Chen
    Mats Wahlgren
    Malaria Journal, 5
  • [28] HybridGCN for protein solubility prediction with adaptive weighting of multiple features
    Chen, Long
    Wu, Rining
    Zhou, Feixiang
    Zhang, Huifeng
    Liu, Jian K.
    JOURNAL OF CHEMINFORMATICS, 2023, 15 (01)
  • [29] HybridGCN for protein solubility prediction with adaptive weighting of multiple features
    Long Chen
    Rining Wu
    Feixiang Zhou
    Huifeng Zhang
    Jian K. Liu
    Journal of Cheminformatics, 15
  • [30] Fusion Protein Strategy to Increase Expression and Solubility of Hypervariable Region of VP2 Protein of Infectious Bursal Disease Virus in Escherichia coli
    Sahar Sadat Sedighzadeh
    Mehdi Shamsara
    Azar Shahpiri
    The Protein Journal, 2012, 31 : 580 - 584