Spark-based ensemble learning for imbalanced data classification

被引:0
|
作者
Ding J. [1 ]
Wang S. [1 ]
Jia L. [1 ]
You J. [1 ]
Jiang Y. [1 ]
机构
[1] Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming
基金
中国国家自然科学基金;
关键词
Comprehensive weight; Ensemble learning; Imbalanced data classification; Random forest; Spark;
D O I
10.23940/ijpe.18.05.p14.955964
中图分类号
学科分类号
摘要
With the rapid expansion of Big Data in all science and engineering domains, imbalanced data classification become a more acute problem in various real-world datasets. It is notably difficult to develop an efficient model by using mechanically the current data mining and machine learning algorithms. In this paper, we propose a Spark-based Ensemble Learning for imbalanced data classification approach (SELidc in short). The key point of SELidc lies in preprocessing to balance the imbalanced datasets, and to improve the performance and reduce fitting for the big and imbalanced data by building distributed ensemble learning algorithm. So, SELidc firstly converts the original imbalanced dataset into resilient distributed datasets. Next, in the sampling process, it samples by comprehensive weight, which is obtained in accordance with the weight of each class in majority class and the number of minority class samples. After that, it trains several classifiers with random forest in Spark environment by the correlation feature selection means. Experiments on publicly available UCI datasets and other datasets demonstrate that SELidc achieves more prominent results than other related approaches across various evaluation metrics, it makes full use of the efficient computing power of Spark distributed platform in training the massive data. © 2018 Totem Publisher, Inc. All rights reserved.
引用
收藏
页码:945 / 964
页数:19
相关论文
共 50 条
  • [11] Meta-learning for imbalanced data and classification ensemble in binary classification
    Lin, Sung-Chiang
    Chang, Yuan-chin I.
    Yang, Wei-Ning
    NEUROCOMPUTING, 2009, 73 (1-3) : 484 - 494
  • [12] A Method of Imbalanced Traffic Classification Based on Ensemble Learning
    Ding, Yaojun
    2015 IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING, COMMUNICATIONS AND COMPUTING (ICSPCC), 2015, : 265 - 268
  • [13] imDC: an ensemble learning method for imbalanced classification with miRNA data
    Wang, C. Y.
    Hu, L. L.
    Guo, M. Z.
    Liu, X. Y.
    Zou, Q.
    GENETICS AND MOLECULAR RESEARCH, 2015, 14 (01): : 123 - 133
  • [14] A Selective Ensemble Learning Framework for ECG-Based Heartbeat Classification with Imbalanced Data
    Ge, Hongwei
    Sun, Keyi
    Sun, Liang
    Zhao, Mingde
    Wu, Chunguo
    PROCEEDINGS 2018 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2018, : 2753 - 2755
  • [15] Ensemble weighted extreme learning machine for imbalanced data classification based on differential evolution
    Zhang, Yong
    Liu, Bo
    Cai, Jing
    Zhang, Suhua
    NEURAL COMPUTING & APPLICATIONS, 2017, 28 : S259 - S267
  • [16] Ensemble weighted extreme learning machine for imbalanced data classification based on differential evolution
    Yong Zhang
    Bo Liu
    Jing Cai
    Suhua Zhang
    Neural Computing and Applications, 2017, 28 : 259 - 267
  • [17] A Spark-Based Artificial Bee Colony Algorithm for Unbalanced Large Data Classification
    Al-Sawwa, Jamil
    Almseidin, Mohammad
    INFORMATION, 2022, 13 (11)
  • [18] Ensemble Approach for the Classification of Imbalanced Data
    Nikulin, Vladimir
    McLachlan, Geoffrey J.
    Ng, Shu Kay
    AI 2009: ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2009, 5866 : 291 - +
  • [19] KDE-Based Ensemble Learning for Imbalanced Data
    Kamalov, Firuz
    Moussa, Sherif
    Reyes, Jorge Avante
    ELECTRONICS, 2022, 11 (17)
  • [20] Ensemble Learning with Active Example Selection for Imbalanced Biomedical Data Classification
    Oh, Sangyoon
    Lee, Min Su
    Zhang, Byoung-Tak
    IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2011, 8 (02) : 316 - 325