Data imbalance in classification: Experimental evaluation

被引:400
|
作者
Thabtah, Fadi [1 ]
Hammoud, Suhel [2 ]
Kamalov, Firuz [3 ]
Gonsalves, Amanda [1 ]
机构
[1] Manukau Inst Technol, Corner Manukau Stn Rd,Davies Ave, Auckland 2104, New Zealand
[2] Univ Kalamoon, Deir Atiyah An Nabek Dist Rif Dimashq Governorate, Deir Atiyah, Syria
[3] Canadian Univ Dubai, Sheikh Zayed Rd, Dubai, U Arab Emirates
关键词
Classification; Class imbalance; Data analysis; Machine learning; Statistical analysis; Supervised learning; FEATURES;
D O I
10.1016/j.ins.2019.11.004
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The advent of Big Data has ushered a new era of scientific breakthroughs. One of the common issues that affects raw data is class imbalance problem which refers to imbalanced distribution of values of the response variable. This issue is present in fraud detection, network intrusion detection, medical diagnostics, and a number of other fields where negatively labeled instances significantly outnumber positively labeled instances. Modern machine learning techniques struggle to deal with imbalanced data by focusing on minimizing the error rate for the majority class while ignoring the minority class. The goal of our paper is demonstrate the effects of class imbalance on classification models. Concretely, we study the impact of varying class imbalance ratios on classifier accuracy. By highlighting the precise nature of the relationship between the degree of class imbalance and the corresponding effects on classifier performance we hope to help researchers to better tackle the problem. To this end, we carry out extensive experiments using 10-fold cross validation on a large number of datasets. In particular, we determine that the relationship between the class imbalance ratio and the accuracy is convex. (C) 2019 Elsevier Inc. All rights reserved.
引用
收藏
页码:429 / 441
页数:13
相关论文
共 50 条
  • [31] Performance analysis of data resampling on class imbalance and classification techniques on multi-omics data for cancer classification
    Yang, Yuting
    Mirzaei, Golrokh
    PLOS ONE, 2024, 19 (02):
  • [32] SPECTRAL DERIVATIVE FEATURES FOR SUPERVISED CLASSIFICATION OF REMOTE SENSING DATA: AN EXPERIMENTAL EVALUATION
    Bao, Jiangfeng
    Chi, Mingmin
    Benediktsson, Jon Atli
    2012 4TH WORKSHOP ON HYPERSPECTRAL IMAGE AND SIGNAL PROCESSING (WHISPERS), 2012,
  • [33] An Experimental Evaluation of Data Classification Models for Credibility Based Fake News Detection
    Ramkissoon, Amit Neil
    Mohammed, Shareeda
    20TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2020), 2020, : 93 - 100
  • [34] Hybrid Firefly Optimised Ensemble Classification for Drifting Data Streams with Imbalance
    Pepsi, M. Blessa Binolin
    Kumar, N. Senthil
    KNOWLEDGE-BASED SYSTEMS, 2024, 288
  • [35] Online Classification Algorithm for Concept Drift and Class Imbalance Data Stream
    Lu K.-Z.
    Chen C.-F.
    Cai H.
    Wu D.-M.
    Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2022, 50 (03): : 585 - 597
  • [36] Using deep learning to predict user rating on imbalance classification data
    Hendry
    Chen, Rung-Ching
    IAENG International Journal of Computer Science, 2019, 46 (01)
  • [37] Benchmarking binary classification models on data sets with different degrees of imbalance
    Zhou, Ligang
    Lai, Kin Keung
    FRONTIERS OF COMPUTER SCIENCE IN CHINA, 2009, 3 (02): : 205 - 216
  • [38] Sequential targeting: A continual learning approach for data imbalance in text classification
    Jang, Joel
    Kim, Yoonjeon
    Choi, Kyoungho
    Suh, Sungho
    EXPERT SYSTEMS WITH APPLICATIONS, 2021, 179
  • [39] A Comprehensive Data Imbalance Analysis for Covid-19 Classification Dataset
    Tissir, Zineb
    Poudel, Sahadev
    Baidya, Ranjai
    Lee, Sang-Woong
    12TH INTERNATIONAL CONFERENCE ON ICT CONVERGENCE (ICTC 2021): BEYOND THE PANDEMIC ERA WITH ICT CONVERGENCE INNOVATION, 2021, : 20 - 24
  • [40] Benchmarking binary classification models on data sets with different degrees of imbalance
    Ligang Zhou
    Kin Keung Lai
    Frontiers of Computer Science in China, 2009, 3 : 205 - 216