Benchmarking protein language models for protein crystallization

被引：0

作者：

Mall, Raghvendra ^{[1
]}

Kaushik, Rahul ^{[1
]}

Martinez, Zachary A. ^{[2
]}

Thomson, Matt W. ^{[2
]}

Castiglione, Filippo ^{[1
,3
]}

机构：

[1] Technol Innovat Inst, Biotechnol Res Ctr, POB 9639, Abu Dhabi, U Arab Emirates

[2] CALTECH, Div Biol & Bioengn, Pasadena, CA 91125 USA

[3] Natl Res Council Italy, Inst Appl Comp, I-00185 Rome, Italy

来源：

SCIENTIFIC REPORTS | 2025年 / 15卷 / 01期

关键词：

Open protein language models (PLMs); Protein crystallization; Benchmarking; Protein generation; PROPENSITY PREDICTION; REFINEMENT;

D O I：

10.1038/s41598-025-86519-5

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

The problem of protein structure determination is usually solved by X-ray crystallography. Several in silico deep learning methods have been developed to overcome the high attrition rate, cost of experiments and extensive trial-and-error settings, for predicting the crystallization propensities of proteins based on their sequences. In this work, we benchmark the power of open protein language models (PLMs) through the TRILL platform, a be-spoke framework democratizing the usage of PLMs for the task of predicting crystallization propensities of proteins. By comparing LightGBM / XGBoost classifiers built on the average embedding representations of proteins learned by different PLMs, such as ESM2, Ankh, ProtT5-XL, ProstT5, xTrimoPGLM, SaProt with the performance of state-of-the-art sequence-based methods like DeepCrystal, ATTCrys and CLPred, we identify the most effective methods for predicting crystallization outcomes. The LightGBM classifiers utilizing embeddings from ESM2 model with 30 and 36 transformer layers and 150 and 3000 million parameters respectively have performance gains by 3-\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$5\%$$\end{document} than all compared models for various evaluation metrics, including AUPR (Area Under Precision-Recall Curve), AUC (Area Under the Receiver Operating Characteristic Curve), and F1 on independent test sets. Furthermore, we fine-tune the ProtGPT2 model available via TRILL to generate crystallizable proteins. Starting with 3000 generated proteins and through a step of filtration processes including consensus of all open PLM-based classifiers, sequence identity through CD-HIT, secondary structure compatibility, aggregation screening, homology search and foldability evaluation, we identified a set of 5 novel proteins as potentially crystallizable.

引用

页数：17

共 50 条

[31] Protein crystallization
Rosenberger, F
JOURNAL OF CRYSTAL GROWTH, 1996, 166 (1-4) : 40 - 54
[32] Single-sequence protein structure prediction by integrating protein language models
Jing, Xiaoyang
Wu, Fandi
Luo, Xiao
Xu, Jinbo
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2024, 121 (13)
[33] Crystallization of protein-protein complexes
Radaev, S
Sun, PD
JOURNAL OF APPLIED CRYSTALLOGRAPHY, 2002, 35 : 674 - 676
[34] Evaluation of kinetic models of seeded batch protein crystallization.
Carbone, MN
Etzel, MR
ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2004, 227 : U247 - U247
[35] Benchmarking DNA large language models on quadruplexes
Cherednichenko, Oleksandr
Herbert, Alan
Poptsova, Maria
COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL, 2025, 27 : 992 - 1000
[36] Benchmarking AutoGen with different large language models
Barbarroxa, Rafael
Ribeiro, Bruno
Gomes, Luis
Vale, Zita
2024 IEEE CONFERENCE ON ARTIFICIAL INTELLIGENCE, CAI 2024, 2024, : 263 - 264
[37] Benchmarking Large Language Models for News Summarization
Zhang, Tianyi
Ladhak, Faisal
Durmus, Esin
Liang, Percy
Mckeown, Kathleen
Hashimoto, Tatsunori B.
TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2024, 12 : 39 - 57
[38] Benchmarking Large Language Models: Opportunities and Challenges
Hodak, Miro
Ellison, David
Van Buren, Chris
Jiang, Xiaotong
Dholakia, Ajay
PERFORMANCE EVALUATION AND BENCHMARKING, TPCTC 2023, 2024, 14247 : 77 - 89
[39] Boosting Protein Language Models with Negative Sample Mining
Xu, Yaoyao
Zhao, Xinjian
Song, Xiaozhuang
Wang, Benyou
Yu, Tianshu
MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES-APPLIED DATA SCIENCE TRACK, PT X, ECML PKDD 2024, 2024, 14950 : 199 - 214
[40] Protein language models guide directed antibody evolution
Arunima Singh
Nature Methods, 2023, 20 : 785 - 785

← 1 2 3 4 5 →