Machine learning for mining imbalanced data

被引：0

作者：

Arafat, Md. Yasir ^{[1
]}

Hoque, Sabera ^{[2
]}

Xu, Shuxiang ^{[3
]}

Farid, Dewan Md ^{[4
]}

机构：

[1] Wipro Limited as a Technical Lead, India

[2] Computer Science and Engineering Department, United International University, Bangladesh

[3] School of Technology, Environments and Design, University of Tasmania, Australia

[4] United International University, Bangladesh

来源：

IAENG International Journal of Computer Science | 2019年 / 46卷 / 02期

关键词：

Data mining - Adaptive boosting - Machine learning;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Mining imbalanced data, which is also known as a class imbalanced problem is one of the most enormously challenging tasks in machine learning for data mining applications. To achieve overall accurate performance in imbalanced classification employing machine learning techniques is difficult as the majority class instances always overpower the minority class instances by a huge difference. An unequal distribution is very common in real-world high dimensional datasets, where binary classification is more frequent than multi-class classification task. Most existing machine learning algorithms are more focused on classifying majority class instances while ignoring or misclassifying minority class instances. Several techniques have been introduced in the last decades for imbalanced data classification, where each of this techniques has their own advantages and disadvantages. In this paper, we have studied and compared 12 extensively imbalanced data classification methods: SMOTE, AdaBoost, RUSBoost, EUSBoost, SMOTEBoost, MSMOTEBoost, DataBoost, Easy Ensemble, BalanceCascade, OverBagging, UnderBagging, SMOTEBagging to extract their characteristics and performance on 27 imbalanced datasets. In general, the combination of both over-sampling and undersampling techniques with ensemble classifiers such as bagging and boosting achieve the highest accuracy for classifying both majority and minority class instances. Additionally, an extensive and critical review of the existing algorithms of imbalanced learning is provided with detailed discussion. According to our findings, we advise some practical suggestions based on the reviewed papers to offer further research directions for imbalanced learning. © International Association of Engineers.

引用

页码：332 / 348

共 50 条

[1] Imbalanced Data Problem in Machine Learning: A Review
Altalhan, Manahel
Algarni, Abdulmohsen
Alouane, Monia Turki-Hadj
IEEE ACCESS, 2025, 13 : 13686 - 13699
[2] Machine Learning on Imbalanced Data in Credit Risk
Birla, Shiivong
Kohli, Kashish
Dutta, Akash
7TH IEEE ANNUAL INFORMATION TECHNOLOGY, ELECTRONICS & MOBILE COMMUNICATION CONFERENCE IEEE IEMCON-2016, 2016,
[3] Machine learning and data mining
Mitchell, TM
COMMUNICATIONS OF THE ACM, 1999, 42 (11) : 30 - 36
[4] Integrating Data Selection and Extreme Learning Machine for Imbalanced Data
Mahdiyah, Umi
Irawan, M. Isa
Imah, Elly Matul
INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND COMPUTATIONAL INTELLIGENCE (ICCSCI 2015), 2015, 59 : 221 - 229
[5] Adversarial Approaches to Tackle Imbalanced Data in Machine Learning
Ayoub, Shahnawaz
Gulzar, Yonis
Rustamov, Jaloliddin
Jabbari, Abdoh
Reegu, Faheem Ahmad
Turaev, Sherzod
SUSTAINABILITY, 2023, 15 (09)
[6] Evolutionary Online Machine Learning from Imbalanced Data
Stein, Anthony
2016 IEEE 1ST INTERNATIONAL WORKSHOPS ON FOUNDATIONS AND APPLICATIONS OF SELF* SYSTEMS (FAS*W), 2016, : 281 - 286
[7] A comparative analysis of machine learning techniques for imbalanced data
Mrad, Ali Ben
Lahiani, Amine
Mefteh-Wali, Salma
Mselmi, Nada
ANNALS OF OPERATIONS RESEARCH, 2024,
[8] Machine-learning classifiers for imbalanced tornado data
Trafalis T.B.
Adrianto I.
Richman M.B.
Lakshmivarahan S.
Computational Management Science, 2014, 11 (4) : 403 - 418
[9] An Improved Extreme Learning Machine for Imbalanced Data Classification
Zhang, Xiaopeng
Qin, Liangxi
IEEE ACCESS, 2022, 10 : 8634 - 8642
[10] A machine learning method for incomplete and imbalanced medical data
Salman, Issam
Vomlel, Jiri
PROCEEDINGS OF THE 20TH CZECH-JAPAN SEMINAR ON DATA ANALYSIS AND DECISION MAKING UNDER UNCERTAINTY, 2017, : 188 - 195

← 1 2 3 4 5 →