Enhanced SMOTE Algorithm for Classification of Imbalanced Big-Data using Random Forest

被引:0
|
作者
Bhagat, Reshma C. [1 ]
Patil, Sachin S. [1 ]
机构
[1] Rajarambapu Inst Technol, Dept CSE, Islampur Sangli, MS, India
关键词
Data mining; Multi-class Imbalanced data; Oversampling; MapReduce; Machine Learning;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In the era of big data, the applications generating tremendous amount of data are becoming the main focus of attention as the wide increment of data generation and storage that has taken place in the last few years. This scenario is challenging for data mining techniques which are not arrogated to the new space and time requirements. In many of the real world applications, classification of imbalanced data-sets is the point of attraction. Most of the classification methods focused on two-class imbalanced problem. So, it is necessary to solve multi-class imbalanced problem, which exist in real-world domains. In the proposed work, we introduced a methodology for classification of multi-class imbalanced data. This methodology consists of two steps: In first step we used Binarization techniques (OVA and OVO) for decomposing original dataset into subsets of binary classes. In second step, the SMOTE algorithm is applied against each subset of imbalanced binary class in order to get balanced data. Finally, to achieve classification goal Random Forest (RF) classifier is used. Specifically, oversampling technique is adapted to big data using MapReduce so that this technique is able to handle as large data-set as needed. An experimental study is carried out to evaluate the performance of proposed method. For experimental analysis, we have used different datasets from UCI repository and the proposed system is implemented on Apache Hadoop and Apache Spark platform. The results obtained shows that proposed method outperforms over other methods.
引用
收藏
页码:403 / 408
页数:6
相关论文
共 50 条
  • [31] Comparison of Sampling Methods for Imbalanced Data Classification in Random Forest
    Paing, May Phu
    Pintavirooj, C.
    Tungjitkusolmun, Supan
    Choomchuay, Somsak
    Hamamoto, Kazuhiko
    2018 11TH BIOMEDICAL ENGINEERING INTERNATIONAL CONFERENCE (BMEICON 2018), 2018,
  • [32] DBSM: The Combination of DBSCAN and SMOTE for Imbalanced Data Classification
    Sanguanmak, Yotsathon
    Hanskunatai, Anantaporn
    2016 13TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER SCIENCE AND SOFTWARE ENGINEERING (JCSSE), 2016, : 509 - 513
  • [33] A Classification Method for Imbalanced Data Based on SMOTE and Fuzzy Rough Nearest Neighbor Algorithm
    Zhao, Weibin
    Xu, Mengting
    Jia, Xiuyi
    Shang, Lin
    ROUGH SETS, FUZZY SETS, DATA MINING, AND GRANULAR COMPUTING, RSFDGRC 2015, 2015, 9437 : 340 - 351
  • [34] Adaptive SV-Borderline SMOTE-SVM algorithm for imbalanced data classification
    Guo, Jiaqi
    Wu, Haiyan
    Chen, Xiaolei
    Lin, Weiguo
    APPLIED SOFT COMPUTING, 2024, 150
  • [35] A histogram SMOTE-based sampling algorithm with incremental learning for imbalanced data classification
    Liaw, Lawrence Chuin Ming
    Tan, Shing Chiang
    Goh, Pey Yun
    Lim, Chee Peng
    INFORMATION SCIENCES, 2025, 686
  • [36] SMOTE algorithm applying imbalanced data in higher education
    Zhang, Mengjie
    Yang, Jing
    PROCEEDINGS OF THE 2ND INFORMATION TECHNOLOGY AND MECHATRONICS ENGINEERING CONFERENCE (ITOEC 2016), 2016, 24 : 185 - 188
  • [37] Imbalanced Data Classification of Pathological Speech Using PCA, SMOTE, and Expectation Maximization
    Dingam, Camille
    Zhang, Xueying
    Duan, Shufei
    Li, Haifeng
    Chen, Xiaoyu
    COMMUNICATIONS, SIGNAL PROCESSING, AND SYSTEMS, VOL. 1, 2022, 878 : 309 - 317
  • [38] A Minimax Approach for Classification with Big-data
    Krishnan, R.
    Jagannathan, S.
    Samaranayake, V. A.
    2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2018, : 1437 - 1444
  • [39] Enhanced IDS Using BBA and SMOTE-ENN for Imbalanced Data for Cybersecurity
    Neha Pramanick
    Shourya Srivastava
    Jimson Mathew
    Mayank Agarwal
    SN Computer Science, 5 (7)
  • [40] Spatiotemporal data partitioning for distributed random forest algorithm: Air quality prediction using imbalanced big spatiotemporal data on spark distributed framework
    Asgari, Marjan
    Yang, Wanhong
    Farnaghi, Mahdi
    ENVIRONMENTAL TECHNOLOGY & INNOVATION, 2022, 27