A Novel Hybrid Sampling Algorithm for Solving Class Imbalance Problem in Big Data

被引:1
|
作者
Ahlawat, Khyati [1 ]
Chug, Anuradha [2 ]
Singh, Amit Prakash [2 ]
机构
[1] Indira Gandhi Delhi Tech Univ Women Kashmere Gate, Delhi 110006, India
[2] Guru Gobind Singh Indraprastha Univ, Univ Sch Informat Commun & Technol, Sector 16C, Delhi 110078, India
关键词
Imbalance data; clustering; big data processing; biasness; sampling; CLASSIFICATION; MAPREDUCE; PREDICTION; SYSTEMS; FRAMEWORK; INSIGHT; HADOOP;
D O I
10.1142/S2424922X21500054
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
The uneven distribution of classes in any dataset poses a tendency of biasness toward the majority class when analyzed using any standard classifier. The instances of the significant class being deficient in numbers are generally ignored and their correct classification which is of paramount interest is often overlooked in calculating overall accuracy. Therefore, the conventional machine learning approaches are rigorously refined to address this class imbalance problem. This challenge of imbalanced classes is more prevalent in big data scenario due to its high volume. This study deals with acknowledging a sampling solution based on cluster computing in handling class imbalance problems in the case of big data. The newly proposed approach hybrid sampling algorithm (HSA) is assessed using three popular classification algorithms namely, support vector machine, decision tree and k-nearest neighbor based on balanced accuracy and elapsed time. The results obtained from the experiment are considered promising with an efficiency gain of 42% in comparison to the traditional sampling solution synthetic minority oversampling technique (SMOTE). This work proves the effectiveness of the distribution and clustering principle in imbalanced big data scenarios.
引用
收藏
页数:18
相关论文
共 50 条
  • [31] HCBST: An Efficient Hybrid Sampling Technique for Class Imbalance Problems
    Sowah, Robert A.
    Kuditchar, Bernard
    Mills, Godfrey A.
    Acakpovi, Amevi
    Twum, Raphael A.
    Buah, Gifty
    Agboyi, Robert
    [J]. ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2022, 16 (03)
  • [32] A Hybrid Evolutionary Under-sampling Method for Handling the Class Imbalance Problem with Overlap in Credit Classification
    Gong, Ping
    Gao, Junguang
    Wang, Li
    [J]. JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING, 2022, 31 (06) : 728 - 752
  • [33] An Improved Hybrid Approach for Handling Class Imbalance Problem
    Abeer S. Desuky
    Sadiq Hussain
    [J]. Arabian Journal for Science and Engineering, 2021, 46 : 3853 - 3864
  • [34] A Hybrid Evolutionary Under-sampling Method for Handling the Class Imbalance Problem with Overlap in Credit Classification
    Ping Gong
    Junguang Gao
    Li Wang
    [J]. Journal of Systems Science and Systems Engineering, 2022, 31 : 728 - 752
  • [35] An Improved Hybrid Approach for Handling Class Imbalance Problem
    Desuky, Abeer S.
    Hussain, Sadiq
    [J]. ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2021, 46 (04) : 3853 - 3864
  • [36] TLUSBoost algorithm: a boosting solution for class imbalance problem
    Sujit Kumar
    Saroj Kr. Biswas
    Debashree Devi
    [J]. Soft Computing, 2019, 23 : 10755 - 10767
  • [37] TLUSBoost algorithm: a boosting solution for class imbalance problem
    Kumar, Sujit
    Biswas, Saroj Kr.
    Devi, Debashree
    [J]. SOFT COMPUTING, 2019, 23 (21) : 10755 - 10767
  • [38] An Approximation Algorithm for Solving a Class of Minimax Problem
    Zheng, Yingchun
    [J]. INTERNATIONAL JOURNAL OF GRID AND DISTRIBUTED COMPUTING, 2016, 9 (05): : 31 - 40
  • [39] Analysis of Novel Hybrid Encryption Algorithm for Providing Security in Big Data
    Dwivedi, Nikhil
    Malik, Arun
    [J]. ADVANCED INFORMATICS FOR COMPUTING RESEARCH, ICAICR 2019, PT II, 2019, 1076 : 158 - 169
  • [40] Addressing the Big Data Multi-class Imbalance Problem with Oversampling and Deep Learning Neural Networks
    Gonzalez-Barcenas, V. M.
    Rendon, E.
    Alejo, R.
    Granda-Gutierrez, E. E.
    Valdovinos, R. M.
    [J]. PATTERN RECOGNITION AND IMAGE ANALYSIS, PT I, 2020, 11867 : 216 - 224