A Method for Analyzing the Performance Impact of Imbalanced Binary Data on Machine Learning Models

被引:6
|
作者
Zheng, Ming [1 ,2 ]
Wang, Fei [1 ]
Hu, Xiaowen [1 ]
Miao, Yuhao [3 ]
Cao, Huo [1 ]
Tang, Mingjing [4 ,5 ]
机构
[1] Anhui Normal Univ, Sch Comp & Informat, Wuhu 241002, Peoples R China
[2] Anhui Prov Key Lab Network & Informat Secur, Wuhu 241002, Peoples R China
[3] Anhui Normal Univ, Affiliated Inst, Wuhu 241002, Peoples R China
[4] Yunnan Normal Univ, Sch Life Sci, Kunming 650500, Yunnan, Peoples R China
[5] Yunnan Normal Univ, Engn Res Ctr Sustainable Dev & Utilizat Biomass E, Minist Educ, Kunming 650500, Yunnan, Peoples R China
关键词
machine learning models; imbalanced data; machine learning; data mining; performance impact; NEURAL-NETWORK; SMOTE; CLASSIFIERS;
D O I
10.3390/axioms11110607
中图分类号
O29 [应用数学];
学科分类号
070104 ;
摘要
Machine learning models may not be able to effectively learn and predict from imbalanced data in the fields of machine learning and data mining. This study proposed a method for analyzing the performance impact of imbalanced binary data on machine learning models. It systematically analyzes 1. the relationship between varying performance in machine learning models and imbalance rate (IR); 2. the performance stability of machine learning models on imbalanced binary data. In the proposed method, the imbalanced data augmentation algorithms are first designed to obtain the imbalanced dataset with gradually varying IR. Then, in order to obtain more objective classification results, the evaluation metric AFG, arithmetic mean of area under the receiver operating characteristic curve (AUC), F-measure and G-mean are used to evaluate the classification performance of machine learning models. Finally, based on AFG and coefficient of variation (CV), the performance stability evaluation method of machine learning models is proposed. Experiments of eight widely used machine learning models on 48 different imbalanced datasets demonstrate that the classification performance of machine learning models decreases with the increase of IR on the same imbalanced data. Meanwhile, the classification performances of LR, DT and SVC are unstable, while GNB, BNB, KNN, RF and GBDT are relatively stable and not susceptible to imbalanced data. In particular, the BNB has the most stable classification performance. The Friedman and Nemenyi post hoc statistical tests also confirmed this result. The SMOTE method is used in oversampling-based imbalanced data augmentation, and determining whether other oversampling methods can obtain consistent results needs further research. In the future, an imbalanced data augmentation algorithm based on undersampling and hybrid sampling should be used to analyze the performance impact of imbalanced binary data on machine learning models.
引用
收藏
页数:19
相关论文
共 50 条
  • [1] Android malware detection: Investigating the impact of imbalanced data-sets on the performance of machine learning models.
    Sawadogo, Zakaria
    Mendy, Gervais
    Dembele, Jean Marie
    Ouya, Samuel
    [J]. 2022 24TH INTERNATIONAL CONFERENCE ON ADVANCED COMMUNICATION TECHNOLOGY (ICACT): ARITIFLCIAL INTELLIGENCE TECHNOLOGIES TOWARD CYBERSECURITY, 2022, : 435 - +
  • [2] A machine learning method for incomplete and imbalanced medical data
    Salman, Issam
    Vomlel, Jiri
    [J]. PROCEEDINGS OF THE 20TH CZECH-JAPAN SEMINAR ON DATA ANALYSIS AND DECISION MAKING UNDER UNCERTAINTY, 2017, : 188 - 195
  • [3] A New Performance Evaluation Method for Imbalanced Data Learning
    Dong, Yuan-Fang
    Li, Xiong-Fei
    Li, Jun
    Zhao, Hai-Ying
    [J]. 2011 AASRI CONFERENCE ON APPLIED INFORMATION TECHNOLOGY (AASRI-AIT 2011), VOL 2, 2011, : 166 - 169
  • [4] Comparative Performance of Deep Learning and Machine Learning Algorithms on Imbalanced Handwritten Data
    Amri, A'Inur A'Fifah
    Ismail, Amelia Ritahani
    Zarir, Abdullah Ahmad
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2018, 9 (02) : 258 - 264
  • [5] The impact of imbalanced training data on machine learning for author name disambiguation
    Jinseok Kim
    Jenna Kim
    [J]. Scientometrics, 2018, 117 : 511 - 526
  • [6] The impact of imbalanced training data on machine learning for author name disambiguation
    Kim, Jinseok
    Kim, Jenna
    [J]. SCIENTOMETRICS, 2018, 117 (01) : 511 - 526
  • [7] Machine learning for mining imbalanced data
    Arafat, Md. Yasir
    Hoque, Sabera
    Xu, Shuxiang
    Farid, Dewan Md
    [J]. IAENG International Journal of Computer Science, 2019, 46 (02) : 332 - 348
  • [8] Performance of Machine Learning Classifiers for Malware Detection Over Imbalanced Data
    Morillo, Paulina
    Bahamonde, Diego
    Tapia, Wilian
    [J]. INTELLIGENT SYSTEMS AND APPLICATIONS, VOL 1, INTELLISYS 2023, 2024, 822 : 496 - 507
  • [9] Machine learning and statistical models for analyzing multilevel patent data
    Sunyun Qi
    Yu Zhang
    Hua Gu
    Fei Zhu
    Meiying Gao
    Hongxiao Liang
    Qifeng Zhang
    Yanchao Gao
    [J]. Scientific Reports, 13
  • [10] Machine learning and statistical models for analyzing multilevel patent data
    Qi, Sunyun
    Zhang, Yu
    Gu, Hua
    Zhu, Fei
    Gao, Meiying
    Liang, Hongxiao
    Zhang, Qifeng
    Gao, Yanchao
    [J]. SCIENTIFIC REPORTS, 2023, 13 (01)