Under Sampling Techniques for Handling Unbalanced Data with Various Imbalance Rates: A Comparative Study

被引:0
|
作者
Abu Elsoud, Esraa [1 ]
Hassan, Mohamad [1 ]
Alidmat, Omar [1 ]
Al Henawi, Esraa [1 ]
Alshdaifat, Nawaf [2 ]
Igtait, Mosab [3 ]
Ghaben, Ayman [4 ]
Katrawi, Anwar [3 ]
Dmour, Mohmmad [1 ]
机构
[1] Zarqa Univ, Fac Informat Technol, Dept Comp Sci, Zarqa, Jordan
[2] Appl Sci Private Univ, Fac Informat Technol, Amman, Jordan
[3] Zarqa Univ, Dept Data Sci & Artificial Intelligence, Zarqa, Jordan
[4] Zarqa Univ, Fac Informat Technol, Dept Cyber Secur, Zarqa, Jordan
关键词
Clusters centroid; decision tree; neighborhood cleaning rule; random under sampling; Tomek Link under sampling; unbalanced datasets; FEATURE-SELECTION;
D O I
10.14569/IJACSA.2024.01508124
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Unbalanced data sets represent data sets that contain an unequal number of examples for different classes. This dataset represents a problem faced by machine learning tools; as in datasets with high imbalance ratios, false negative rate percentages will be increased because most classifiers will be affected by the major class. Choosing specific evaluation metrics that are most informative and sampling techniques represent a common way to handle this problem. In this paper, a comparative analysis between four of the most common under-sampling techniques is conducted over datasets with various imbalance rates (IR) range from low to medium to high IR. Decision Tree classifier and twelve imbalanced data sets with various IR are used for evaluating the effects of each technique depending on Recall, F1-measure, gmean, recall for minor class, and F1-measure for minor class evaluation metrics. Results demonstrate that Clusters Centroid outperformed Neighborhood Cleaning Rule (NCL) based on recall for all low IR datasets. For both medium, and high IR datasets NCL, and Random Under Sampling (RUS) outperformed the rest techniques, while Tomek Link has the worst effect.
引用
收藏
页码:1274 / 1284
页数:11
相关论文
共 50 条
  • [1] A Comparative Study on Sampling Techniques for Handling Class Imbalance in Streaming Data
    Nguyen, Hien M.
    Cooper, Eric W.
    Kamei, Katsuari
    [J]. 6TH INTERNATIONAL CONFERENCE ON SOFT COMPUTING AND INTELLIGENT SYSTEMS, AND THE 13TH INTERNATIONAL SYMPOSIUM ON ADVANCED INTELLIGENT SYSTEMS, 2012, : 1762 - 1767
  • [2] Convolutional neural network applied to detect electricity theft: A comparative study on unbalanced data handling techniques
    Pereira, Jeanne
    Saraiva, Filipe
    [J]. INTERNATIONAL JOURNAL OF ELECTRICAL POWER & ENERGY SYSTEMS, 2021, 131
  • [3] A Comparative Analysis of Unbalanced Data Handling Techniques for Machine Learning Algorithms to Electricity Theft Detection
    Pereira, Jeanne
    Saraiva, Filipe
    [J]. 2020 IEEE CONGRESS ON EVOLUTIONARY COMPUTATION (CEC), 2020,
  • [4] Sampling strategies for handling data imbalance problem: An Extensive Review
    Veedhi, Bhaskar Kumar
    Mishra, Debahuti
    Das, Kaberi
    [J]. JOURNAL OF STATISTICS AND MANAGEMENT SYSTEMS, 2023, 26 (01) : 177 - 187
  • [5] A Comparative Study of Various Methods for Handling Missing Data in UNSODA
    Fu, Yingpeng
    Liao, Hongjian
    Lv, Longlong
    [J]. AGRICULTURE-BASEL, 2021, 11 (08):
  • [6] Comparative Study between Various Algorithms of Data Compression Techniques
    Al-Laham, Mohammed
    El Emary, Ibrahiem M. M.
    [J]. INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2007, 7 (04): : 281 - 291
  • [7] Comparative study between various algorithms of data compression techniques
    Al-laham, Mohammed
    El Emary, Ibrahiem M. M.
    [J]. WCECS 2007: WORLD CONGRESS ON ENGINEERING AND COMPUTER SCIENCE, 2007, : 326 - +
  • [8] A COMPARATIVE STUDY OF DATA SAMPLING TECHNIQUES FOR CONSTRUCTING NEURAL NETWORK ENSEMBLES
    Akhand, M. A. H.
    Islam, M. D. Monirul
    Murase, Kazuyuki
    [J]. INTERNATIONAL JOURNAL OF NEURAL SYSTEMS, 2009, 19 (02) : 67 - 89
  • [9] Cluster Based Under-Sampling for Unbalanced Cardiovascular Data
    Rahman, M. Mostafizur
    Davis, D. N.
    [J]. WORLD CONGRESS ON ENGINEERING - WCE 2013, VOL III, 2013, : 1480 - 1485
  • [10] Comparative study of cardiovascular markers data by various techniques of multivariate analysis
    Balla, B
    Mocak, J
    Pivovarnikova, H
    Balla, J
    [J]. CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2004, 72 (02) : 259 - 267