A Method for Analyzing the Performance Impact of Imbalanced Binary Data on Machine Learning Models

被引:6
|
作者
Zheng, Ming [1 ,2 ]
Wang, Fei [1 ]
Hu, Xiaowen [1 ]
Miao, Yuhao [3 ]
Cao, Huo [1 ]
Tang, Mingjing [4 ,5 ]
机构
[1] Anhui Normal Univ, Sch Comp & Informat, Wuhu 241002, Peoples R China
[2] Anhui Prov Key Lab Network & Informat Secur, Wuhu 241002, Peoples R China
[3] Anhui Normal Univ, Affiliated Inst, Wuhu 241002, Peoples R China
[4] Yunnan Normal Univ, Sch Life Sci, Kunming 650500, Yunnan, Peoples R China
[5] Yunnan Normal Univ, Engn Res Ctr Sustainable Dev & Utilizat Biomass E, Minist Educ, Kunming 650500, Yunnan, Peoples R China
关键词
machine learning models; imbalanced data; machine learning; data mining; performance impact; NEURAL-NETWORK; SMOTE; CLASSIFIERS;
D O I
10.3390/axioms11110607
中图分类号
O29 [应用数学];
学科分类号
070104 ;
摘要
Machine learning models may not be able to effectively learn and predict from imbalanced data in the fields of machine learning and data mining. This study proposed a method for analyzing the performance impact of imbalanced binary data on machine learning models. It systematically analyzes 1. the relationship between varying performance in machine learning models and imbalance rate (IR); 2. the performance stability of machine learning models on imbalanced binary data. In the proposed method, the imbalanced data augmentation algorithms are first designed to obtain the imbalanced dataset with gradually varying IR. Then, in order to obtain more objective classification results, the evaluation metric AFG, arithmetic mean of area under the receiver operating characteristic curve (AUC), F-measure and G-mean are used to evaluate the classification performance of machine learning models. Finally, based on AFG and coefficient of variation (CV), the performance stability evaluation method of machine learning models is proposed. Experiments of eight widely used machine learning models on 48 different imbalanced datasets demonstrate that the classification performance of machine learning models decreases with the increase of IR on the same imbalanced data. Meanwhile, the classification performances of LR, DT and SVC are unstable, while GNB, BNB, KNN, RF and GBDT are relatively stable and not susceptible to imbalanced data. In particular, the BNB has the most stable classification performance. The Friedman and Nemenyi post hoc statistical tests also confirmed this result. The SMOTE method is used in oversampling-based imbalanced data augmentation, and determining whether other oversampling methods can obtain consistent results needs further research. In the future, an imbalanced data augmentation algorithm based on undersampling and hybrid sampling should be used to analyze the performance impact of imbalanced binary data on machine learning models.
引用
收藏
页数:19
相关论文
共 50 条
  • [21] Interpretable machine learning models for failure cause prediction in imbalanced oil pipeline data
    Awuku, Bright
    Huang, Ying
    Yodo, Nita
    Asa, Eric
    [J]. MEASUREMENT SCIENCE AND TECHNOLOGY, 2024, 35 (07)
  • [22] A Comparative Study of Shallow Machine Learning Models and Deep Learning Models for Landslide Susceptibility Assessment Based on Imbalanced Data
    Xu, Shiluo
    Song, Yingxu
    Hao, Xiulan
    [J]. FORESTS, 2022, 13 (11):
  • [23] SIMPLE METHOD FOR ANALYZING BINARY DATA
    LAPAGE, SP
    WILLCOX, WR
    [J]. JOURNAL OF GENERAL MICROBIOLOGY, 1974, 85 (DEC): : 376 - 380
  • [24] Limitations in Evaluating Machine Learning Models for Imbalanced Binary Outcome Classification in Spine Surgery: A Systematic Review
    Ghanem, Marc
    Ghaith, Abdul Karim
    El-Hajj, Victor Gabriel
    Bhandarkar, Archis
    de Giorgio, Andrea
    Elmi-Terander, Adrian
    Bydon, Mohamad
    [J]. BRAIN SCIENCES, 2023, 13 (12)
  • [25] Performance and efficiency of machine learning algorithms for analyzing rectangular biomedical data
    Deng, Fei
    Huang, Jibing
    Yuan, Xiaoling
    Cheng, Chao
    Zhang, Lanjing
    [J]. LABORATORY INVESTIGATION, 2021, 101 (04) : 430 - 441
  • [26] Integrating Data Selection and Extreme Learning Machine for Imbalanced Data
    Mahdiyah, Umi
    Irawan, M. Isa
    Imah, Elly Matul
    [J]. INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND COMPUTATIONAL INTELLIGENCE (ICCSCI 2015), 2015, 59 : 221 - 229
  • [27] The Performance of Allocation Method on Imbalanced Data
    Karakatic, Saso
    Hericko, Marjan
    Podgorelec, Vili
    [J]. INFORMATION MODELLING AND KNOWLEDGE BASES XXVIII, 2017, 292 : 382 - 395
  • [28] On Learning Deep Models with Imbalanced Data Distribution
    Majumdar, Puspita
    Singh, Richa
    Vatsa, Mayank
    [J]. THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 15720 - 15721
  • [29] The quest for the reliability of machine learning models in binary classification on tabular data
    Vitor Cirilo Araujo Santos
    Lucas Cardoso
    Ronnie Alves
    [J]. Scientific Reports, 13
  • [30] The quest for the reliability of machine learning models in binary classification on tabular data
    Santos, Vitor Cirilo Araujo
    Cardoso, Lucas
    Alves, Ronnie
    [J]. SCIENTIFIC REPORTS, 2023, 13 (01)