Data imbalance in classification: Experimental evaluation

被引:400
|
作者
Thabtah, Fadi [1 ]
Hammoud, Suhel [2 ]
Kamalov, Firuz [3 ]
Gonsalves, Amanda [1 ]
机构
[1] Manukau Inst Technol, Corner Manukau Stn Rd,Davies Ave, Auckland 2104, New Zealand
[2] Univ Kalamoon, Deir Atiyah An Nabek Dist Rif Dimashq Governorate, Deir Atiyah, Syria
[3] Canadian Univ Dubai, Sheikh Zayed Rd, Dubai, U Arab Emirates
关键词
Classification; Class imbalance; Data analysis; Machine learning; Statistical analysis; Supervised learning; FEATURES;
D O I
10.1016/j.ins.2019.11.004
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The advent of Big Data has ushered a new era of scientific breakthroughs. One of the common issues that affects raw data is class imbalance problem which refers to imbalanced distribution of values of the response variable. This issue is present in fraud detection, network intrusion detection, medical diagnostics, and a number of other fields where negatively labeled instances significantly outnumber positively labeled instances. Modern machine learning techniques struggle to deal with imbalanced data by focusing on minimizing the error rate for the majority class while ignoring the minority class. The goal of our paper is demonstrate the effects of class imbalance on classification models. Concretely, we study the impact of varying class imbalance ratios on classifier accuracy. By highlighting the precise nature of the relationship between the degree of class imbalance and the corresponding effects on classifier performance we hope to help researchers to better tackle the problem. To this end, we carry out extensive experiments using 10-fold cross validation on a large number of datasets. In particular, we determine that the relationship between the class imbalance ratio and the accuracy is convex. (C) 2019 Elsevier Inc. All rights reserved.
引用
收藏
页码:429 / 441
页数:13
相关论文
共 50 条
  • [21] A method for data imbalance in defective solar cell detection in Electroluminescence images based on experimental evaluation
    Jiang, Jiacheng
    Zhang, Kanjian
    Zhang, Jinxia
    2023 35TH CHINESE CONTROL AND DECISION CONFERENCE, CCDC, 2023, : 4138 - 4145
  • [22] Through the Data Management Lens: Experimental Analysis and Evaluation of Fair Classification
    Islam, Maliha Tashfia
    Fariha, Anna
    Meliou, Alexandra
    Salimi, Babak
    PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA (SIGMOD '22), 2022, : 232 - 246
  • [23] Implications of resampling data to address the class imbalance problem (IRCIP): an evaluation of impact on performance between classification algorithms in medical data
    Welvaars, Koen
    Oosterhoff, Jacobien H. F.
    van den Bekerom, Michel P. J.
    Doornberg, Job N.
    van Haarst, Ernst P.
    JAMIA OPEN, 2023, 6 (02)
  • [24] Imbalance Data Classification Algorithm based on SVM and Clustering Function
    Lin, Kai-Biao
    Weng, Wei
    Lai, Robert K.
    Lu, Ping
    2014 PROCEEDINGS OF THE 9TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE & EDUCATION (ICCSE 2014), 2014, : 544 - 548
  • [25] Selection of Augmented Data for Overcoming the Imbalance Problem in Facies Classification
    Kim, Dowan
    Byun, Joongmoo
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2022, 19
  • [26] Solving Data Imbalance in Text Classification With Constructing Contrastive Samples
    Chen, Xi
    Zhang, Wei
    Pan, Shuai
    Chen, Jiayin
    IEEE ACCESS, 2023, 11 : 90554 - 90562
  • [27] A Comprehensive Survey of Imbalance Correction Techniques for Hyperspectral Data Classification
    Paoletti, Mercedes E.
    Mogollon-Gutierrez, Oscar
    Moreno-Alvarez, Sergio
    Sancho, Jose Carlos
    Haut, Juan M.
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2023, 16 : 5297 - 5314
  • [28] Review of Random Forest Classification Techniques to Resolve Data Imbalance
    More, A. S.
    Rana, Dipti P.
    2017 1ST INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS AND INFORMATION MANAGEMENT (ICISIM), 2017, : 72 - 78
  • [29] A Novel Algorithm for Imbalance Data Classification Based on Neighborhood Hypergraph
    Hu, Feng
    Liu, Xiao
    Dai, Jin
    Yu, Hong
    SCIENTIFIC WORLD JOURNAL, 2014,
  • [30] Experimental Evaluation of Application Triggered Flow Classification Using Operated Data Center Traffic Data
    Murakami, Masaki
    Matsuno, Masahiro
    Okamoto, Satoru
    Yamanaka, Naoaki
    2019 24TH OPTOELECTRONICS AND COMMUNICATIONS CONFERENCE (OECC) AND 2019 INTERNATIONAL CONFERENCE ON PHOTONICS IN SWITCHING AND COMPUTING (PSC), 2019,