Multi-class imbalanced big data classification on Spark

被引:50
|
作者
Sleeman, William C. [1 ]
Krawczyk, Bartosz [1 ]
机构
[1] Virginia Commonwealth Univ, Dept Comp Sci, Richmond, VA 23284 USA
关键词
Machine learning; Big data; Imbalanced data classification; Multi-class imbalance; Spark; MapReduce; DECISION TREE; MAPREDUCE; SELECTION; ENSEMBLE; INFORMATION; ALGORITHMS; IMPROVE; SMOTE;
D O I
10.1016/j.knosys.2020.106598
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Despite more than two decades of progress, learning from imbalanced data is still considered as one of the contemporary challenges in machine learning. This has been further complicated by the advent of the big data era, where popular algorithms dedicated to alleviating the class skew impact are no longer feasible due to the volume of datasets. Additionally, most of existing algorithms focus on binary imbalanced problems, where majority and minority classes are well-defined. Multi-class imbalanced data poses further challenges as the relationship between classes is much more complex and simple decomposition into a number of binary problems leads to a significant loss of information. In this paper, we propose the first compound framework for dealing with multi-class big data problems, addressing at the same time the existence of multiple classes and high volumes of data. We propose to analyze the instance-level difficulties in each class, leading to understanding what causes learning difficulties. We embed this information in popular resampling algorithms which allows for informative balancing of multiple classes. We propose an efficient implementation of the discussed algorithm on Apache Spark, including a novel version of SMOTE that overcomes spatial limitations in distributed environments of its predecessor. Extensive experimental study shows that using instance-level information significantly improves learning from multi-class imbalanced big data. Our framework can be downloaded from https://github.com/fsleeman/minority-type-imbalanced. (C) 2020 Elsevier B.V. All rights reserved.
引用
收藏
页数:15
相关论文
共 50 条
  • [41] A Dynamic Spark-based Classification Framework for Imbalanced Big Data
    Nahla B. Abdel-Hamid
    Sally ElGhamrawy
    Ali El Desouky
    Hesham Arafat
    [J]. Journal of Grid Computing, 2018, 16 : 607 - 626
  • [42] BPSO-Adaboost-KNN ensemble learning algorithm for multi-class imbalanced data classification
    Guo Haixiang
    Li Yijing
    Li Yanan
    Liu Xiao
    Li Jinling
    [J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2016, 49 : 176 - 193
  • [43] AUC Evaluation of Multi-class Classifier Performance in Imbalanced Data
    Ni, Huangjing
    Wang, Wei
    [J]. 2010 INTERNATIONAL CONFERENCE ON FUTURE CONTROL AND AUTOMATION (ICFCA 2010), 2010, : 48 - 51
  • [44] Efficient DANNLO classifier for multi-class imbalanced data on Hadoop
    Satyanarayana S.
    Tayar Y.
    Prasad R.S.R.
    [J]. International Journal of Information Technology, 2019, 11 (2) : 321 - 329
  • [45] Learning Imbalanced Multi-class Data with Optimal Dichotomy Weights
    Liu, Xu-Ying
    Li, Qian-Qian
    Zhou, Zhi-Hua
    [J]. 2013 IEEE 13TH INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2013, : 478 - 487
  • [46] A Partial Labeling Framework for Multi-Class Imbalanced Streaming Data
    Arabmakki, Elaheh
    Kantardzic, Mehmed
    Sethi, Tegjyot Singh
    [J]. 2017 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2017, : 1018 - 1025
  • [47] Multi-class Ensemble Learning of Imbalanced Bidding Fraud Data
    Anowar, Farzana
    Sadaoui, Samira
    [J]. ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, 11489 : 352 - 358
  • [48] Imbalanced Multi-class Classification of Structural Damage in a Wind Turbine Foundation
    Leon-Medina, Jersson X.
    Pares, Nuria
    Anaya, Maribel
    Tibaduiza, Diego
    Pozo, Francesc
    [J]. EUROPEAN WORKSHOP ON STRUCTURAL HEALTH MONITORING (EWSHM 2022), VOL 3, 2023, : 492 - 500
  • [49] A Novel and Effective Multi-Class Classification Method for Imbalanced Medical Transcriptions
    Bhardwaj, Priti
    Baliyan, Niyati
    [J]. IETE JOURNAL OF RESEARCH, 2024,
  • [50] A New Multi-Class WSVM Classification to Imbalanced Human Activity Dataset
    Abidine, M'hamed B.
    Fergani, Belkacem
    [J]. JOURNAL OF COMPUTERS, 2014, 9 (07) : 1560 - 1565