Distributed Ensemble Feature Selection Framework for High-Dimensional and High-Skewed Imbalanced Big Dataset

被引:1
|
作者
Soheili, Majid [1 ]
Haeri, Maryam Amir Amir [2 ]
机构
[1] Islamic Azad Univ, Comp Engn Dept, Neka Branch, Neka, Iran
[2] Univ Twente, Learning Data Analyt & Technol Dept, Enschede, Netherlands
关键词
Scalable Feature Selection; Distributed Ensemble Learning; Imbalanced Big Data Set;
D O I
10.1109/SSCI50451.2021.9659937
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The class-imbalance problem emerges when the class labels of a dataset have a skewed distribution. In this circumstance, the instances belonging to one class, which is exactly the principal purpose, are dominated thoroughly by the instances belonging to other classes. In recent years, feature selection for high-dimensional imbalanced data has become attraction research scope. This technique concerns selecting an informative feature set to improve the accuracy of the classification model. Moreover, as a subcategory of feature selection, the feature ranking technique has been deliberated to cope with high-dimensional datasets in the last decade. On the one hand, most traditional feature selection methods are not scalable, which is critical to cope with large-scale datasets. On the other hand, scalability is an intrinsic characteristic of the ensemble learning approach. This paper proposes a Distributed Ensemble Imbalanced feature selection framework, called DEIM, to deal with big imbalanced datasets. The DEIM, at first, transforms default data partitions to representative partitions in a single pass. Second, it applies a feature ranking method in a bagging approach upon each partition independently. Finally, It fuses intermediate feature rankings in a stacking strategy. In this paper, two traditional feature ranking algorithms, ReliefF and QPFS, are plugged into DEIM. Therefore, two methods DEIM-Relief and DEIM-QPFS, are produced. Experiments are accomplished on three big imbalanced datasets and upon a computer cluster. The empirical study depicts that the produced methods are scalable. Also, they have lower execution times, and their final results can induce better classification models than DiReliefF and DQPFS.
引用
收藏
页数:8
相关论文
共 50 条
  • [1] Feature selection for high-dimensional imbalanced data
    Yin, Liuzhi
    Ge, Yong
    Xiao, Keli
    Wang, Xuehua
    Quan, Xiaojun
    [J]. NEUROCOMPUTING, 2013, 105 : 3 - 11
  • [2] Feature Selection with High-Dimensional Imbalanced Data
    Van Hulse, Jason
    Khoshgoftaar, Taghi M.
    Napolitano, Amri
    Wald, Randall
    [J]. 2009 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2009), 2009, : 507 - 514
  • [3] Ensemble learning-based filter-centric hybrid feature selection framework for high-dimensional imbalanced data
    Kim, Jongmo
    Kang, Jaewoong
    Sohn, Mye
    [J]. KNOWLEDGE-BASED SYSTEMS, 2021, 220
  • [4] Online feature selection for high-dimensional class-imbalanced data
    Zhou, Peng
    Hu, Xuegang
    Li, Peipei
    Wu, Xindong
    [J]. KNOWLEDGE-BASED SYSTEMS, 2017, 136 : 187 - 199
  • [5] Research of Medical High-dimensional Imbalanced Data Classification-Ensemble Feature Selection Algorithm with Random Forest
    Zhu, Min
    Su, Bo
    Ning, Gangmin
    [J]. 2017 INTERNATIONAL CONFERENCE ON SMART GRID AND ELECTRICAL AUTOMATION (ICSGEA), 2017, : 273 - 277
  • [6] A general framework of nonparametric feature selection in high-dimensional data
    Yu, Hang
    Wang, Yuanjia
    Zeng, Donglin
    [J]. BIOMETRICS, 2023, 79 (02) : 951 - 963
  • [7] A Feature Grouping Method for Ensemble Clustering of High-Dimensional Genomic Big Data
    Farid, Dewan Md.
    Nowe, Ann
    Manderick, Bernard
    [J]. PROCEEDINGS OF 2016 FUTURE TECHNOLOGIES CONFERENCE (FTC), 2016, : 260 - 268
  • [8] Ultra High-Dimensional Nonlinear Feature Selection for Big Biological Data
    Yamada, Makoto
    Tang, Jiliang
    Lugo-Martinez, Jose
    Hodzic, Ermin
    Shrestha, Raunak
    Saha, Avishek
    Ouyang, Hua
    Yin, Dawei
    Mamitsuka, Hiroshi
    Sahinalp, Cenk
    Radivojac, Predrag
    Menczer, Filippo
    Chang, Yi
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2018, 30 (07) : 1352 - 1365
  • [9] Feature selection for high-dimensional data
    Bolón-Canedo V.
    Sánchez-Maroño N.
    Alonso-Betanzos A.
    [J]. Progress in Artificial Intelligence, 2016, 5 (2) : 65 - 75
  • [10] Feature selection for high-dimensional data
    Destrero A.
    Mosci S.
    De Mol C.
    Verri A.
    Odone F.
    [J]. Computational Management Science, 2009, 6 (1) : 25 - 40