KNN-based dynamic query-driven sample rescaling strategy for class imbalance learning

被引:24
|
作者
Hu, Jun [1 ]
Li, Yang [1 ]
Yan, Wu-Xia [1 ]
Yang, Jing-Yu [1 ]
Shen, Hong-Bin [2 ]
Yu, Dong-Jun [1 ]
机构
[1] Nanjing Univ Sci & Technol, Sch Comp Sci & Engn, Xiaolingwei 200, Nanjing 210094, Jiangsu, Peoples R China
[2] Shanghai Jiao Tong Univ, Inst Image Proc & Pattern Recognit, Dongchuan Rd 800, Shanghai 200240, Peoples R China
基金
中国国家自然科学基金;
关键词
Imbalanced learning; Sample rescaling; Dynamic query-driven sample rescaling; Classifier ensemble; Protein-nucleotide binding residues prediction; ATP BINDING RESIDUES; PROTEIN; PREDICTION; SEQUENCE; SITES; ENSEMBLE; IDENTIFICATION;
D O I
10.1016/j.neucom.2016.01.043
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The class imbalance phenomenon is pervasive in bioinformatics prediction problems in which the number of majority samples is significantly larger than that of minority samples. Relieving the severity of class imbalance has been demonstrated to be a promising route for enhancing the prediction performance of a statistical machine learning-based predictor under an imbalanced learning scenario. In this study, we propose a novel dynamic query-driven sample rescaling (DQD-SR) strategy for addressing class imbalance. Unlike the traditional sample rescaling technique, which often yields a fixed balanced dataset, the proposed DQD-SR dynamically generates a query-driven balanced dataset based on KNN algorithm. A prediction model trained on a traditional sample rescaling (T-SR)-derived balanced dataset will partially learn the global knowledge buried in the original dataset, whereas a prediction model trained on DQD-SR will reflect the query-specific local knowledge between a query sample and its correlated neighbors in the original dataset. Thus, we developed an ensemble scheme to integrate the T-SR-based model and the DQD-SR-based model to further improve the overall prediction performance. To demonstrate the efficacy of the proposed method, we performed stringent cross-validation and independent validation tests on benchmark datasets concerning protein-nucleotide binding residues prediction, which is a typical imbalanced learning problem in bioinformatics. Computer experimental results show that the proposed method achieves high prediction performance and outperforms existing sequence-based protein nucleotide binding residues predictors. We also implemented a predictor called TargetNUCs, which is freely available for academic use at http://csbio.njust.edu.cn/bioinf/TargetNUCs. (C) 2016 Elsevier B.V. All rights reserved.
引用
收藏
页码:363 / 373
页数:11
相关论文
共 6 条
  • [1] pQLyCar: Peptide-based dynamic query-driven sample rescaling strategy for identifying carboxylation sites combined with KNN and SVM
    Ning, Qiao
    Deng, Ansheng
    Zou, Tingting
    Zhao, Xiaowei
    ANALYTICAL BIOCHEMISTRY, 2021, 633
  • [2] KNN-based ensemble selection for imbalance learning
    Zheng, Guirong
    Wu, Chang-An
    Guo, Huaping
    International Journal of Computational Systems Engineering, 2019, 5 (02): : 82 - 96
  • [3] An evolution-based DNA-binding residue predictor using a dynamic query-driven learning scheme
    Chai, H.
    Zhang, J.
    Yang, G.
    Ma, Z.
    MOLECULAR BIOSYSTEMS, 2016, 12 (12) : 3643 - 3650
  • [4] Constructing Query-Driven Dynamic Machine Learning Model With Application to Protein-Ligand Binding Sites Prediction
    Yu, Dong-Jun
    Hu, Jun
    Li, Qian-Mu
    Tang, Zhen-Min
    Yang, Jing-Yu
    Shen, Hong-Bin
    IEEE TRANSACTIONS ON NANOBIOSCIENCE, 2015, 14 (01) : 45 - 58
  • [5] Skin lesion image retrieval using transfer learning-based approach for query-driven distance recommendation
    Barhoumi, Walid
    Khelifa, Afifa
    COMPUTERS IN BIOLOGY AND MEDICINE, 2021, 137
  • [6] Dynamic Data-Driven Carbon-Based Electric Vehicle Charging Pricing Strategy Using Machine Learning
    Garrido, Jacqueline
    Barth, Matthew J.
    Enriquez-Contreras, Luis
    Hasan, Asm Jahid
    Todd, Michael
    Ula, Sadrul
    Yusuf, Jubair
    2021 IEEE INTELLIGENT TRANSPORTATION SYSTEMS CONFERENCE (ITSC), 2021, : 1670 - 1676