Benchmarking protein language models for protein crystallization

被引:0
|
作者
Mall, Raghvendra [1 ]
Kaushik, Rahul [1 ]
Martinez, Zachary A. [2 ]
Thomson, Matt W. [2 ]
Castiglione, Filippo [1 ,3 ]
机构
[1] Technol Innovat Inst, Biotechnol Res Ctr, POB 9639, Abu Dhabi, U Arab Emirates
[2] CALTECH, Div Biol & Bioengn, Pasadena, CA 91125 USA
[3] Natl Res Council Italy, Inst Appl Comp, I-00185 Rome, Italy
来源
SCIENTIFIC REPORTS | 2025年 / 15卷 / 01期
关键词
Open protein language models (PLMs); Protein crystallization; Benchmarking; Protein generation; PROPENSITY PREDICTION; REFINEMENT;
D O I
10.1038/s41598-025-86519-5
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The problem of protein structure determination is usually solved by X-ray crystallography. Several in silico deep learning methods have been developed to overcome the high attrition rate, cost of experiments and extensive trial-and-error settings, for predicting the crystallization propensities of proteins based on their sequences. In this work, we benchmark the power of open protein language models (PLMs) through the TRILL platform, a be-spoke framework democratizing the usage of PLMs for the task of predicting crystallization propensities of proteins. By comparing LightGBM / XGBoost classifiers built on the average embedding representations of proteins learned by different PLMs, such as ESM2, Ankh, ProtT5-XL, ProstT5, xTrimoPGLM, SaProt with the performance of state-of-the-art sequence-based methods like DeepCrystal, ATTCrys and CLPred, we identify the most effective methods for predicting crystallization outcomes. The LightGBM classifiers utilizing embeddings from ESM2 model with 30 and 36 transformer layers and 150 and 3000 million parameters respectively have performance gains by 3-\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$5\%$$\end{document} than all compared models for various evaluation metrics, including AUPR (Area Under Precision-Recall Curve), AUC (Area Under the Receiver Operating Characteristic Curve), and F1 on independent test sets. Furthermore, we fine-tune the ProtGPT2 model available via TRILL to generate crystallizable proteins. Starting with 3000 generated proteins and through a step of filtration processes including consensus of all open PLM-based classifiers, sequence identity through CD-HIT, secondary structure compatibility, aggregation screening, homology search and foldability evaluation, we identified a set of 5 novel proteins as potentially crystallizable.
引用
收藏
页数:17
相关论文
共 50 条
  • [41] The promises of large language models for protein design and modeling
    Valentini, Giorgio
    Malchiodi, Dario
    Gliozzo, Jessica
    Mesiti, Marco
    Soto-Gomez, Mauricio
    Cabri, Alberto
    Reese, Justin
    Casiraghi, Elena
    Robinson, Peter N.
    FRONTIERS IN BIOINFORMATICS, 2023, 3
  • [42] Investigating the utility of protein language models for modeling isoforms
    Zhang, Zhidian
    Wayment-Steele, Hannah
    Garyk, Brixi
    Sergey, Ovchinnikov
    PROTEIN SCIENCE, 2023, 32 (12)
  • [43] Protein language models guide directed antibody evolution
    Singh, Arunima
    NATURE METHODS, 2023, 20 (06) : 785 - 785
  • [44] Current progress, challenges, and future perspectives of language models for protein representation and protein design
    Huang, Tao
    Li, Yixue
    INNOVATION, 2023, 4 (04):
  • [45] TemStaPro: protein thermostability prediction using sequence representations from protein language models
    Pudziuvelyte, Ieva
    Olechnovic, Kliment
    Godliauskaite, Egle
    Sermokas, Kristupas
    Urbaitis, Tomas
    Gasiunas, Giedrius
    Kazlauskas, Darius
    BIOINFORMATICS, 2024, 40 (04)
  • [46] Using protein language models for protein interaction hot spot prediction with limited data
    Karen Sargsyan
    Carmay Lim
    BMC Bioinformatics, 25
  • [47] Benchmarking Inverse Statistical Approaches for Protein Structure and Design with Exactly Solvable Models
    Jacquin, Hugo
    Gilson, Amy
    Shakhnovich, Eugene
    Cocco, Simona
    Monasson, Remi
    PLOS COMPUTATIONAL BIOLOGY, 2016, 12 (05)
  • [48] Exploring evolution-aware & -free protein language models as protein function predictors
    Hu, Mingyang
    Yuan, Fajie
    Yang, Kevin K.
    Ju, Fusong
    Su, Jin
    Wang, Hui
    Yang, Fei
    Ding, Qiuyang
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [49] Using protein language models for protein interaction hot spot prediction with limited data
    Sargsyan, Karen
    Lim, Carmay
    BMC BIOINFORMATICS, 2024, 25 (01)
  • [50] Benchmarking of protein interaction databases for integration with manually reconstructed signalling network models
    Van de Graaf, Matthew W.
    Eggertsen, Taylor G.
    Zeigler, Angela C.
    Tan, Philip M.
    Saucerman, Jeffrey J.
    JOURNAL OF PHYSIOLOGY-LONDON, 2024, 602 (18): : 4529 - 4542