Spark-based ensemble learning for imbalanced data classification

被引:0
|
作者
Ding J. [1 ]
Wang S. [1 ]
Jia L. [1 ]
You J. [1 ]
Jiang Y. [1 ]
机构
[1] Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming
基金
中国国家自然科学基金;
关键词
Comprehensive weight; Ensemble learning; Imbalanced data classification; Random forest; Spark;
D O I
10.23940/ijpe.18.05.p14.955964
中图分类号
学科分类号
摘要
With the rapid expansion of Big Data in all science and engineering domains, imbalanced data classification become a more acute problem in various real-world datasets. It is notably difficult to develop an efficient model by using mechanically the current data mining and machine learning algorithms. In this paper, we propose a Spark-based Ensemble Learning for imbalanced data classification approach (SELidc in short). The key point of SELidc lies in preprocessing to balance the imbalanced datasets, and to improve the performance and reduce fitting for the big and imbalanced data by building distributed ensemble learning algorithm. So, SELidc firstly converts the original imbalanced dataset into resilient distributed datasets. Next, in the sampling process, it samples by comprehensive weight, which is obtained in accordance with the weight of each class in majority class and the number of minority class samples. After that, it trains several classifiers with random forest in Spark environment by the correlation feature selection means. Experiments on publicly available UCI datasets and other datasets demonstrate that SELidc achieves more prominent results than other related approaches across various evaluation metrics, it makes full use of the efficient computing power of Spark distributed platform in training the massive data. © 2018 Totem Publisher, Inc. All rights reserved.
引用
收藏
页码:945 / 964
页数:19
相关论文
共 50 条
  • [1] A Dynamic Spark-based Classification Framework for Imbalanced Big Data
    Abdel-Hamid, Nahla B.
    ElGhamrawy, Sally
    El Desouky, Ali
    Arafat, Hesham
    JOURNAL OF GRID COMPUTING, 2018, 16 (04) : 607 - 626
  • [2] A Dynamic Spark-based Classification Framework for Imbalanced Big Data
    Nahla B. Abdel-Hamid
    Sally ElGhamrawy
    Ali El Desouky
    Hesham Arafat
    Journal of Grid Computing, 2018, 16 : 607 - 626
  • [3] Spark-based deep classifier framework for imbalanced data classification
    Bhowate, Vikas Gajananrao
    Reddy, T. Hanumantha
    COMPUTER METHODS IN BIOMECHANICS AND BIOMEDICAL ENGINEERING-IMAGING AND VISUALIZATION, 2023, 11 (05): : 1661 - 1677
  • [4] Imbalanced Data Classification Method Based on Ensemble Learning
    Xiang, Yu
    Xie, Yongping
    COMMUNICATIONS, SIGNAL PROCESSING, AND SYSTEMS, CSPS 2018, VOL III: SYSTEMS, 2020, 517 : 18 - 24
  • [5] An Improved Ensemble Learning for Imbalanced Data Classification
    Yuan, Zhengwu
    Zhao, Pu
    PROCEEDINGS OF 2019 IEEE 8TH JOINT INTERNATIONAL INFORMATION TECHNOLOGY AND ARTIFICIAL INTELLIGENCE CONFERENCE (ITAIC 2019), 2019, : 408 - 411
  • [6] A synthetic neighborhood generation based ensemble learning for the imbalanced data classification
    Chen, Zhi
    Lin, Tao
    Xia, Xin
    Xu, Hongyan
    Ding, Sha
    APPLIED INTELLIGENCE, 2018, 48 (08) : 2441 - 2457
  • [7] A synthetic neighborhood generation based ensemble learning for the imbalanced data classification
    Zhi Chen
    Tao Lin
    Xin Xia
    Hongyan Xu
    Sha Ding
    Applied Intelligence, 2018, 48 : 2441 - 2457
  • [8] A Genetic-Based Ensemble Learning Applied to Imbalanced Data Classification
    Klikowski, Jakub
    Ksieniewicz, Pawel
    Wozniak, Michal
    INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING (IDEAL 2019), PT II, 2019, 11872 : 340 - 352
  • [9] Multi-window based ensemble learning for classification of imbalanced streaming data
    Li, Hu
    Wang, Ye
    Wang, Hua
    Zhou, Bin
    WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2017, 20 (06): : 1507 - 1525
  • [10] Multi-window based ensemble learning for classification of imbalanced streaming data
    Hu Li
    Ye Wang
    Hua Wang
    Bin Zhou
    World Wide Web, 2017, 20 : 1507 - 1525