Machine learning for mining imbalanced data

被引:0
|
作者
Arafat, Md. Yasir [1 ]
Hoque, Sabera [2 ]
Xu, Shuxiang [3 ]
Farid, Dewan Md [4 ]
机构
[1] Wipro Limited as a Technical Lead, India
[2] Computer Science and Engineering Department, United International University, Bangladesh
[3] School of Technology, Environments and Design, University of Tasmania, Australia
[4] United International University, Bangladesh
关键词
Data mining - Adaptive boosting - Machine learning;
D O I
暂无
中图分类号
学科分类号
摘要
Mining imbalanced data, which is also known as a class imbalanced problem is one of the most enormously challenging tasks in machine learning for data mining applications. To achieve overall accurate performance in imbalanced classification employing machine learning techniques is difficult as the majority class instances always overpower the minority class instances by a huge difference. An unequal distribution is very common in real-world high dimensional datasets, where binary classification is more frequent than multi-class classification task. Most existing machine learning algorithms are more focused on classifying majority class instances while ignoring or misclassifying minority class instances. Several techniques have been introduced in the last decades for imbalanced data classification, where each of this techniques has their own advantages and disadvantages. In this paper, we have studied and compared 12 extensively imbalanced data classification methods: SMOTE, AdaBoost, RUSBoost, EUSBoost, SMOTEBoost, MSMOTEBoost, DataBoost, Easy Ensemble, BalanceCascade, OverBagging, UnderBagging, SMOTEBagging to extract their characteristics and performance on 27 imbalanced datasets. In general, the combination of both over-sampling and undersampling techniques with ensemble classifiers such as bagging and boosting achieve the highest accuracy for classifying both majority and minority class instances. Additionally, an extensive and critical review of the existing algorithms of imbalanced learning is provided with detailed discussion. According to our findings, we advise some practical suggestions based on the reviewed papers to offer further research directions for imbalanced learning. © International Association of Engineers.
引用
收藏
页码:332 / 348
相关论文
共 50 条
  • [41] Scalability and efficiency in data mining and machine learning
    Miera, Wagner, Jr.
    2019 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2019, : 932 - 932
  • [42] Machine Learning Techniques for Data Mining: A Survey
    Sharma, Seema
    Agrawal, Jitendra
    Agarwal, Shikha
    Sharma, Sanjeev
    2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMPUTING RESEARCH (ICCIC), 2013, : 162 - 167
  • [43] Data mining and machine learning in textile industry
    Yildirim, Pelin
    Birant, Derya
    Alpyildiz, Tuba
    WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2018, 8 (01)
  • [44] Fuzzy sets in machine learning and data mining
    Huellermeier, Eyke
    APPLIED SOFT COMPUTING, 2011, 11 (02) : 1493 - 1505
  • [45] Archetypal analysis for machine learning and data mining
    Morup, Morten
    Hansen, Lars Kai
    NEUROCOMPUTING, 2012, 80 : 54 - 63
  • [46] Business data mining - a machine learning perspective
    Bose, I
    Mahapatra, RK
    INFORMATION & MANAGEMENT, 2001, 39 (03) : 211 - 225
  • [47] Special issue data mining and machine learning
    Perner, Petra
    Vingerhoeds, Rob
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2009, 22 (01) : 1 - 2
  • [48] Explainable and interpretable machine learning and data mining
    Atzmueller, Martin
    Fuernkranz, Johannes
    Kliegr, Tomas
    Schmid, Ute
    DATA MINING AND KNOWLEDGE DISCOVERY, 2024, 38 (05) : 2571 - 2595
  • [49] Principles and Theory for Data Mining and Machine Learning
    Kleine, Liliana Lopez
    JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES A-STATISTICS IN SOCIETY, 2010, 173 : 691 - 692
  • [50] Machine learning and data mining for epidemic surveillance
    Kofod-Petersen, Anders
    MEDICAL JOURNAL OF AUSTRALIA, 2012, 196 (05) : 301 - 301