Multi-class imbalanced big data classification on Spark

被引:50
|
作者
Sleeman, William C. [1 ]
Krawczyk, Bartosz [1 ]
机构
[1] Virginia Commonwealth Univ, Dept Comp Sci, Richmond, VA 23284 USA
关键词
Machine learning; Big data; Imbalanced data classification; Multi-class imbalance; Spark; MapReduce; DECISION TREE; MAPREDUCE; SELECTION; ENSEMBLE; INFORMATION; ALGORITHMS; IMPROVE; SMOTE;
D O I
10.1016/j.knosys.2020.106598
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Despite more than two decades of progress, learning from imbalanced data is still considered as one of the contemporary challenges in machine learning. This has been further complicated by the advent of the big data era, where popular algorithms dedicated to alleviating the class skew impact are no longer feasible due to the volume of datasets. Additionally, most of existing algorithms focus on binary imbalanced problems, where majority and minority classes are well-defined. Multi-class imbalanced data poses further challenges as the relationship between classes is much more complex and simple decomposition into a number of binary problems leads to a significant loss of information. In this paper, we propose the first compound framework for dealing with multi-class big data problems, addressing at the same time the existence of multiple classes and high volumes of data. We propose to analyze the instance-level difficulties in each class, leading to understanding what causes learning difficulties. We embed this information in popular resampling algorithms which allows for informative balancing of multiple classes. We propose an efficient implementation of the discussed algorithm on Apache Spark, including a novel version of SMOTE that overcomes spatial limitations in distributed environments of its predecessor. Extensive experimental study shows that using instance-level information significantly improves learning from multi-class imbalanced big data. Our framework can be downloaded from https://github.com/fsleeman/minority-type-imbalanced. (C) 2020 Elsevier B.V. All rights reserved.
引用
收藏
页数:15
相关论文
共 50 条
  • [1] Improved multi-class classification approach for imbalanced big data on spark
    Tinku Singh
    Riya Khanna
    Manish Satakshi
    [J]. The Journal of Supercomputing, 2023, 79 : 6583 - 6611
  • [2] Improved multi-class classification approach for imbalanced big data on spark
    Singh, Tinku
    Khanna, Riya
    Satakshi
    Kumar, Manish
    [J]. JOURNAL OF SUPERCOMPUTING, 2023, 79 (06): : 6583 - 6611
  • [3] Bagging Using Instance-Level Difficulty for Multi-Class Imbalanced Big Data Classification on Spark
    Sleeman, William C.
    Krawczyk, Bartosz
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2019, : 2484 - 2493
  • [4] A survey of multi-class imbalanced data classification methods
    Han, Meng
    Li, Ang
    Gao, Zhihui
    Mu, Dongliang
    Liu, Shujuan
    [J]. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2023, 44 (02) : 2471 - 2501
  • [5] A Combination Method for Multi-Class Imbalanced Data Classification
    Li, Hu
    Zou, Peng
    Han, Weihong
    Xia, Rongze
    [J]. 2013 10TH WEB INFORMATION SYSTEM AND APPLICATION CONFERENCE (WISA 2013), 2013, : 365 - 368
  • [6] Selecting local ensembles for multi-class imbalanced data classification
    Krawczyk, Bartosz
    Cano, Alberto
    Wozniak, Michal
    [J]. 2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2018,
  • [7] Undersampling with Support Vectors for Multi-Class Imbalanced Data Classification
    Krawczyk, Bartosz
    Bellinger, Colin
    Corizzo, Roberto
    Japkowicz, Nathalie
    [J]. 2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [8] Multi-class Boosting for Imbalanced Data
    Fernandez-Baldera, Antonio
    Buenaposada, Jose M.
    Baumela, Luis
    [J]. PATTERN RECOGNITION AND IMAGE ANALYSIS (IBPRIA 2015), 2015, 9117 : 57 - 64
  • [9] Multi-class WHMBoost: An ensemble algorithm for multi-class imbalanced data
    Zhao, Jiakun
    Jin, Ju
    Zhang, Yibo
    Zhang, Ruifeng
    Chen, Si
    [J]. INTELLIGENT DATA ANALYSIS, 2022, 26 (03) : 599 - 614
  • [10] Boosting methods for multi-class imbalanced data classification: an experimental review
    Jafar Tanha
    Yousef Abdi
    Negin Samadi
    Nazila Razzaghi
    Mohammad Asadpour
    [J]. Journal of Big Data, 7