Variance Ranking Attributes Selection Techniques for Binary Classification Problem in Imbalance Data

被引:76
|
作者
Ebenuwa, Solomon H. [1 ]
Sharif, Mhd Saeed [1 ]
Alazab, Mamoun [2 ]
Al-Nemrat, Ameer [1 ]
机构
[1] Univ East London, Sch Architecture Comp & Engn, London E16 2RD, England
[2] Charles Darwin Univ, Coll Engn IT & Environm, Casuarina, NT 0810, Australia
关键词
Imbalanced dataset; class distribution; binary class; imbalance ratio; majority class; minority class; oversampling; under sampling; logistic regression; support vector machine; decision tree; ranked order similarity; peak threshold accuracy; PREDICTION; DISCRETE;
D O I
10.1109/ACCESS.2019.2899578
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Data are being generated and used to support all aspects of healthcare provision, from policy formation to the delivery of primary care services. Particularly, with the change of emphasis from curative to preventive medicine, the importance of data-based research such as data mining and machine learning has emphasized the issues of class distributions in datasets. In typical predictive modeling, the inability to effectively address a class imbalance in a real-life dataset is an important shortcoming of the existing machine learning algorithms. Most algorithms assume a balanced class in their design, resulting in poor performance in predicting the minority target class. Ironically, the minority target class is usually the focus in predicting processes. The misclassification of the minority target class has resulted in serious consequences in detecting chronic diseases and detecting fraud and intrusion where positive cases are erroneously predicted as not positive. This paper presents a new attribute selection technique called variance ranking for handling imbalance class problems in a dataset. The results obtained were compared to two well-known attribute selection techniques: the Pearson correlation and information gain technique. This paper uses a novel similarity measurement technique ranked order similarity-ROS to evaluate the variance ranking attribute selection compared to the Pearson correlations and information gain. Further validation was carried out using three binary classifications: logistic regression, support vector machine, and decision tree. The proposed variance ranking and ranked order similarity techniques showed better results than the benchmarks. The ROS technique provided an excellent means of grading and measuring the similarities where other similarity measurement techniques were inadequate or not applicable.
引用
收藏
页码:24649 / 24666
页数:18
相关论文
共 50 条
  • [31] Using classification techniques to improve replica selection in data grid
    Jin, Hai
    Huang, Jin
    Xie, Xia
    Zhang, Qin
    ON THE MOVE TO MEANINGFUL INTERNET SYSTEMS 2006: COOPIS, DOA, GADA, AND ODBASE PT 2, PROCEEDINGS, 2006, 4276 : 1376 - 1387
  • [32] A two stage grading approach for feature selection and classification of microarray data using Pareto based feature ranking techniques: A case study
    Dash, Rasmita
    JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2020, 32 (02) : 232 - 247
  • [33] Correction to: An empirical study on the joint impact of feature selection and data resampling on imbalance classification
    Chongsheng Zhang
    Paolo Soda
    Jingjun Bi
    Gaojuan Fan
    George Almpanidis
    Salvador García
    Weiping Ding
    Applied Intelligence, 2023, 53 : 8506 - 8506
  • [34] Feature Selection for Data Classification based on Binary Brain Storm Optimization
    Pourpanah, Farhad
    Wang, Ran
    Wang, Xizhao
    PROCEEDINGS OF 2019 6TH IEEE INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND INTELLIGENCE SYSTEMS (CCIS), 2019, : 108 - 113
  • [35] Binary black hole algorithm for feature selection and classification on biological data
    Pashaei, Elnaz
    Aydin, Nizamettin
    APPLIED SOFT COMPUTING, 2017, 56 : 94 - 106
  • [36] An improved binary sparrow search algorithm for feature selection in data classification
    Gad, Ahmed G.
    Sallam, Karam M.
    Chakrabortty, Ripon K.
    Ryan, Michael J.
    Abohany, Amr A.
    NEURAL COMPUTING & APPLICATIONS, 2022, 34 (18): : 15705 - 15752
  • [37] Gene Selection and Classification Approach for Microarray Data based on Random Forest Ranking and BBHA
    Pashaei, Elnaz
    Ozen, Mustafa
    Aydin, Nizamettin
    2016 3RD IEEE EMBS INTERNATIONAL CONFERENCE ON BIOMEDICAL AND HEALTH INFORMATICS, 2016, : 308 - 311
  • [38] Stable variable ranking and selection in regularized logistic regression for severely imbalanced big binary data
    Nadeem, Khurram
    Jabri, Mehdi-Abderrahman
    PLOS ONE, 2023, 18 (01):
  • [39] A survey on effects of class imbalance in data pre-processing stage of classification problem
    Malave, Nitin
    Nimkar, Anant V.
    International Journal of Computational Systems Engineering, 2020, 6 (02) : 63 - 75
  • [40] Generative adversarial network augmentation for solving the training data imbalance problem in crop classification
    Shumilo, Leonid
    Okhrimenko, Anton
    Kussul, Nataliia
    Drozd, Sofiia
    Shkalikov, Oleh
    REMOTE SENSING LETTERS, 2023, 14 (11) : 1131 - 1140