An insight into imbalanced Big Data classification: outcomes and challenges

被引:135
|
作者
Fernandez, Alberto [1 ]
del Rio, Sara [1 ]
Chawla, Nitesh V. [2 ,3 ]
Herrera, Francisco [1 ]
机构
[1] Univ Granada, Dept Comp Sci & Artificial Intelligence, Granada, Spain
[2] Univ Notre Dame, Dept Comp Sci & Engn, 384 Fitzpatrick Hall, Notre Dame, IN 46556 USA
[3] Univ Notre Dame, Interdisciplinary Ctr Network Sci & Applicat, 384 Nieuwland Hall Sci, Notre Dame, IN 46556 USA
基金
美国国家科学基金会;
关键词
Big Data; Imbalanced classification; MapReduce; Pre-processing; Sampling; MAPREDUCE; PERFORMANCE; COMBINATION; SYSTEMS; SMOTE;
D O I
10.1007/s40747-017-0037-9
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Big Data applications are emerging during the last years, and researchers from many disciplines are aware of the high advantages related to the knowledge extraction from this type of problem. However, traditional learning approaches cannot be directly applied due to scalability issues. To overcome this issue, the MapReduce framework has arisen as a "de facto" solution. Basically, it carries out a "divide-and-conquer" distributed procedure in a fault-tolerant way to adapt for commodity hardware. Being still a recent discipline, few research has been conducted on imbalanced classification for Big Data. The reasons behind this are mainly the difficulties in adapting standard techniques to the MapReduce programming style. Additionally, inner problems of imbalanced data, namely lack of data and small disjuncts, are accentuated during the data partitioning to fit the MapReduce programming style. This paper is designed under three main pillars. First, to present the first outcomes for imbalanced classification in Big Data problems, introducing the current research state of this area. Second, to analyze the behavior of standard pre-processing techniques in this particular framework. Finally, taking into account the experimental results obtained throughout this work, we will carry out a discussion on the challenges and future directions for the topic.
引用
收藏
页码:105 / 120
页数:16
相关论文
共 50 条
  • [21] An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics
    Lopez, Victoria
    Fernandez, Alberto
    Garcia, Salvador
    Palade, Vasile
    Herrera, Francisco
    INFORMATION SCIENCES, 2013, 250 : 113 - 141
  • [22] Improved multi-class classification approach for imbalanced big data on spark
    Singh, Tinku
    Khanna, Riya
    Satakshi
    Kumar, Manish
    JOURNAL OF SUPERCOMPUTING, 2023, 79 (06): : 6583 - 6611
  • [23] Fuzzy integral-based ELM ensemble for imbalanced big data classification
    Zhai, Junhai
    Zhang, Sufang
    Zhang, Mingyang
    Liu, Xiaomeng
    SOFT COMPUTING, 2018, 22 (11) : 3519 - 3531
  • [24] SMOTE-GPU: Big Data preprocessing on commodity hardware for imbalanced classification
    Gutiérrez P.D.
    Lastra M.
    Benítez J.M.
    Herrera F.
    Progress in Artificial Intelligence, 2017, 6 (4) : 347 - 354
  • [25] A Classification Method of Imbalanced Big Data Based on Improved SMOTE and Stacked LSTM
    Xu, Wentao
    Journal of Network Intelligence, 2023, 8 (01): : 100 - 112
  • [26] Improved multi-class classification approach for imbalanced big data on spark
    Tinku Singh
    Riya Khanna
    Manish Satakshi
    The Journal of Supercomputing, 2023, 79 : 6583 - 6611
  • [27] Enriched Over-Sampling Techniques for Improving Classification of Imbalanced Big Data
    Patil, Sachin Subhash
    Sonavane, Shefali Pratap
    2017 THIRD IEEE INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING SERVICE AND APPLICATIONS (IEEE BIGDATASERVICE 2017), 2017, : 1 - 10
  • [28] Fuzzy integral-based ELM ensemble for imbalanced big data classification
    Junhai Zhai
    Sufang Zhang
    Mingyang Zhang
    Xiaomeng Liu
    Soft Computing, 2018, 22 : 3519 - 3531
  • [29] Evolutionary Undersampling for Extremely Imbalanced Big Data Classification under Apache Spark
    Triguero, I.
    Galar, M.
    Merino, D.
    Maillo, J.
    Bustince, H.
    Herrera, F.
    2016 IEEE CONGRESS ON EVOLUTIONARY COMPUTATION (CEC), 2016, : 640 - 647
  • [30] Improved cost-sensitive representation of data for solving the imbalanced big data classification problem
    Fattahi, Mahboubeh
    Moattar, Mohammad Hossein
    Forghani, Yahya
    JOURNAL OF BIG DATA, 2022, 9 (01)