HOTGpred: Enhancing human O-linked threonine glycosylation prediction using integrated pretrained protein language model-based features and multi-stage feature selection approach

被引:0
|
作者
Pham N.T. [1 ]
Zhang Y. [2 ]
Rakkiyappan R. [3 ]
Manavalan B. [1 ]
机构
[1] Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Gyeonggi-do, Suwon
[2] Beidahuang Industry Group General Hospital, Harbin
[3] Department of Mathematics, Bharathiar University, Tamil Nadu, Coimbatore
基金
新加坡国家研究基金会;
关键词
Bioinformatics; Extreme gradient boosting; O-linked threonine glycosylation; Post-translational modification; Pretrained protein language model-based features; Two-step feature selection;
D O I
10.1016/j.compbiomed.2024.108859
中图分类号
学科分类号
摘要
O-linked glycosylation is a complex post-translational modification (PTM) in human proteins that plays a critical role in regulating various cellular metabolic and signaling pathways. In contrast to N-linked glycosylation, O-linked glycosylation lacks specific sequence features and maintains an unstable core structure. Identifying O-linked threonine glycosylation sites (OTGs) remains challenging, requiring extensive experimental tests. While bioinformatics tools have emerged for predicting OTGs, their reliance on limited conventional features and absence of well-defined feature selection strategies limit their effectiveness. To address these limitations, we introduced HOTGpred (Human O-linked Threonine Glycosylation predictor), employing a multi-stage feature selection process to identify the optimal feature set for accurately identifying OTGs. Initially, we assessed 25 different feature sets derived from various pretrained protein language model (PLM)-based embeddings and conventional feature descriptors using nine classifiers. Subsequently, we integrated the top five embeddings linearly and determined the most effective scoring function for ranking hybrid features, identifying the optimal feature set through a process of sequential forward search. Among the classifiers, the extreme gradient boosting (XGBT)-based model, using the optimal feature set (HOTGpred), achieved 92.03 % accuracy on the training dataset and 88.25 % on the balanced independent dataset. Notably, HOTGpred significantly outperformed the current state-of-the-art methods on both the balanced and imbalanced independent datasets, demonstrating its superior prediction capabilities. Additionally, SHapley Additive exPlanations (SHAP) and ablation analyses were conducted to identify the features contributing most significantly to HOTGpred. Finally, we developed an easy-to-navigate web server, accessible at https://balalab-skku.org/HOTGpred/, to support glycobiologists in their research on glycosylation structure and function. © 2024 Elsevier Ltd
引用
收藏
相关论文
共 3 条
  • [1] O-GlyThr: Prediction of human O-linked threonine glycosites using multi-feature fusion
    Tang, Hua
    Tang, Qiang
    Zhang, Qian
    Feng, Pengmian
    INTERNATIONAL JOURNAL OF BIOLOGICAL MACROMOLECULES, 2023, 242
  • [2] Prediction of human O-linked glycosylation sites using stacked generalization and embeddings from pre-trained protein language model
    Pakhrin, Subash Chandra
    Chauhan, Neha
    Khan, Salman
    Upadhyaya, Jamie
    Beck, Moriah Rene
    Blanco, Eduardo
    BIOINFORMATICS, 2024, 40 (11)
  • [3] Independent Component Analysis-Based Prediction of O-Linked Glycosylation Sites in Protein Using Multi-Layered Neural Networks
    Wang, Chu-Zheng
    Tan, Xiao-Feng
    Chen, Yen-Wei
    Han, Xian-Hua
    Ito, Masahiro
    Nishikawa, Ikuko
    2010 IEEE 10TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS (ICSP2010), VOLS I-III, 2010, : 1761 - +