Prediction of protein-protein interaction sites through eXtreme gradient boosting with kernel principal component analysis

被引:39
|
作者
Wang, Xue [1 ,2 ]
Zhang, Yaqun [1 ,2 ]
Yu, Bin [1 ,2 ,3 ]
Salhi, Adil [4 ]
Chen, Ruixin [1 ,2 ]
Wang, Lin [1 ,2 ]
Liu, Zengfeng [1 ,2 ]
机构
[1] Qingdao Univ Sci & Technol, Coll Math & Phys, Qingdao 266061, Peoples R China
[2] Qingdao Univ Sci & Technol, Artificial Intelligence & Biomed Big Data Res Ctr, Qingdao 266061, Peoples R China
[3] Sci Computat Lab, Applicat Hainan Prov, Haikou 571158, Hainan, Peoples R China
[4] King Abdullah Univ Sci & Technol KAUST, Computat Bioscience Res Ctr CBRC, Thuwal 23955, Saudi Arabia
基金
中国国家自然科学基金;
关键词
Protein-protein interaction sites; Feature extraction; SMOTE; KPCA; XGBoost; SEQUENCE-BASED PREDICTION; SECONDARY STRUCTURE; CLASSIFIER; IDENTIFICATION; LOCALIZATION; DESCRIPTORS; IDENTIFY; NETWORKS; EEG;
D O I
10.1016/j.compbiomed.2021.104516
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Predicting protein-protein interaction sites (PPI sites) can provide important clues for understanding biological activity. Using machine learning to predict PPI sites can mitigate the cost of running expensive and timeconsuming biological experiments. Here we propose PPISP-XGBoost, a novel PPI sites prediction method based on eXtreme gradient boosting (XGBoost). First, the characteristic information of protein is extracted through the pseudo-position specific scoring matrix (PsePSSM), pseudo-amino acid composition (PseAAC), hydropathy index and solvent accessible surface area (ASA) under the sliding window. Next, these raw features are preprocessed to obtain more optimal representations in order to achieve better prediction. In particular, the synthetic minority oversampling technique (SMOTE) is used to circumvent class imbalance, and the kernel principal component analysis (KPCA) is applied to remove redundant characteristics. Finally, these optimal features are fed to the XGBoost classifier to identify PPI sites. Using PPISP-XGBoost, the prediction accuracy on the training dataset Dset186 reaches 85.4%, and the accuracy on the independent validation datasets Dtestset72, PDBtestset164, Dset_448 and Dset_355 reaches 85.3%, 83.9%, 85.8% and 85.4%, respectively, which all show an increase in accuracy against existing PPI sites prediction methods. These results demonstrate that the PPISPXGBoost method can further enhance the prediction of PPI sites.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] Deep Neural Network and Extreme Gradient Boosting Based Hybrid Classifier for Improved Prediction of Protein-Protein Interaction
    Mahapatra, Satyajit
    Gupta, Vivek Raj
    Sahu, Sitanshu Sekhar
    Panda, Ganapati
    [J]. IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2022, 19 (01) : 155 - 165
  • [2] Enhanced Prediction of Hot Spots at Protein-Protein Interfaces Using Extreme Gradient Boosting
    Hao Wang
    Chuyao Liu
    Lei Deng
    [J]. Scientific Reports, 8
  • [3] Enhanced Prediction of Hot Spots at Protein-Protein Interfaces Using Extreme Gradient Boosting
    Wang, Hao
    Liu, Chuyao
    Deng, Lei
    [J]. SCIENTIFIC REPORTS, 2018, 8
  • [4] Robust principal component analysis-based prediction of protein-protein interaction hot spots
    Sitani, Divya
    Giorgetti, Alejandro
    Alfonso-Prieto, Mercedes
    Carloni, Paolo
    [J]. PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2021, 89 (06) : 639 - 647
  • [5] Fast prediction of protein-protein interaction sites based on Extreme Learning Machines
    Wang, Debby A.
    Wang, Ran
    Yan, Hong
    [J]. NEUROCOMPUTING, 2014, 128 : 258 - 266
  • [6] Prediction of protein-protein interaction sites using patch analysis
    Jones, S
    Thornton, JM
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1997, 272 (01) : 133 - 143
  • [7] Improvement of orbit prediction accuracy using extreme gradient boosting and principal component analysis
    Zhai, Min
    Huyan, Zongbo
    Hu, Yuanyuan
    Jiang, Yu
    Li, Hengnian
    [J]. OPEN ASTRONOMY, 2022, 31 (01) : 229 - 243
  • [8] Protein-Protein Interaction Network Comparison Based On Wavelet and Principal Component Analysis
    Luo Yong
    Zhao Yan
    Cheng Lei
    Jiang Ping
    Wang Jianxin
    [J]. 2010 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE WORKSHOPS (BIBMW), 2010, : 294 - 298
  • [9] Prediction of protein-protein interactions from amino acid sequences with ensemble extreme learning machines and principal component analysis
    Zhu-Hong You
    Ying-Ke Lei
    Lin Zhu
    Junfeng Xia
    Bing Wang
    [J]. BMC Bioinformatics, 14
  • [10] Prediction of protein-protein interactions from amino acid sequences with ensemble extreme learning machines and principal component analysis
    You, Zhu-Hong
    Lei, Ying-Ke
    Zhu, Lin
    Xia, Junfeng
    Wang, Bing
    [J]. BMC BIOINFORMATICS, 2013, 14