ILYCROsite: Identification of lysine crotonylation sites based on FCM-GRNN undersampling technique

被引:0
|
作者
Zuo, Yun [1 ]
Wan, Minquan [1 ]
Shen, Yang [1 ]
Wang, Xinheng [1 ]
He, Wenying [2 ]
Bi, Yue [3 ,4 ]
Liu, Xiangrong [5 ]
Deng, Zhaohong [1 ]
机构
[1] Jiangnan Univ, Sch Artificial Intelligence & Comp Sci, Wuxi 214000, Peoples R China
[2] Hebei Univ Technol, Sch Artificial Intelligence, Tianjin 300130, Peoples R China
[3] Monash Univ, Dept Biochem & Mol Biol, Clayton, Australia
[4] Monash Univ, Biomed Discovery Inst, Clayton, Australia
[5] Xiamen Univ, Natl Inst Data Sci Hlth & Med, Dept Comp Sci & Technol, Xiamen Key Lab Intelligent Storage & Comp, Xiamen 361005, Peoples R China
基金
中国国家自然科学基金;
关键词
Protein lysine crotonylation; Fully connected neural network; Imbalance data processing; Sequence analysis; PREDICTION; SEQUENCE; NETWORK;
D O I
10.1016/j.compbiolchem.2024.108212
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Protein lysine crotonylation is an important post-translational modification that regulates various cellular activities. For example, histone crotonylation affects chromatin structure and promotes histone replacement. Identification and understanding of lysine crotonylation sites is crucial in the field of protein research. However, due to the increasing amount of non-histone crotonylation sites, existing classifiers based on traditional machine learning may encounter performance limitations. In order to address this problem, a novel deep learning-based model for identifying crotonylation sites is presented in this study, given the unique advantages of deep learning techniques for sequence data analysis. In this study, an MLP-Attention-based model was developed for the identification of crotonylation sites. Firstly, three feature extraction strategies, namely Amino Acid Composition, K-mer, and Distance-based residue features extraction strategy, were used to encode crotonylated and noncrotonylated sequences. Then, in order to balance the training dataset, the FCM-GRNN undersampling algorithm combining fuzzy clustering and generalized neural network approaches was introduced. Finally, to improve the effectiveness of crotonylation site identification, we explored various classification algorithms, and based on the relevant experimental performance comparisons, the multilayer perceptron (MLP) combined with the superimposed self-attention mechanism was finally selected to construct the prediction model ILYCROsite. The results obtained from independent testing and five-fold cross-validation demonstrated that the model proposed in this study, ILYCROsite, had excellent performance. Notably, on the independent test set, ILYCROsite achieves an AUC value of 87.93 %, which is significantly better than the existing state-of-the-art models. In addition, SHAP (Shapley Additive exPlanations) values were used to analyze the importance of features and their impact on model predictions. Meanwhile, in order to facilitate researchers to use the prediction model constructed in this study, we developed a prediction program to identify the crotonylation sites in a given protein sequence. The data and code for this program are available at: https://github.com/wmqskr/ILYCROsite.
引用
收藏
页数:12
相关论文
共 21 条