Imbalanced Big Data Classification: A Distributed Implementation of SMOTE

被引:16
|
作者
Rastogi, Avnish Kumar [1 ]
Narang, Nitin [1 ]
Siddiqui, Zamir Ahmad [1 ]
机构
[1] HCL Technol, Noida, Uttar Pradesh, India
关键词
SMOTE; Imbalanced Classification; Locality Sensitivity Hashing; Nearest Neighbors; Spark; Map Reduce;
D O I
10.1145/3170521.3170535
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In the domain of machine learning, quality of data is most critical component for building good models. Predictive analytics is an AI stream used to predict future events based on historical 'earnings and is used in diverse fields like predicting online frauds, oil slicks, intrusion attacks, credit defaults, prognosis of disease cells etc. Unfortunately, in most of these cases, traditional learning models fail to generate required results due to imbalanced nature of data. Here imbalance denotes small number of instances belonging to the class under prediction like fraud instances in the total online transactions. The prediction in imbalanced classification gets further limited due to factors like small disjuncts which get accentuated during the partitioning of data when learning at scale. Synthetic generation of minority class data (SMOTE [(1) under bar]) is one pioneering approach by Chawla [(1) under bar] to offset said limitations and generate more balanced datasets. Although there exists a standard implementation of SMOTE in python, it is unavailable for distributed computing environments for large datasets. Bringing SMOTE to distributed environment under spark is the key motivation for our research. In this paper we present our algorithm, observations and results for synthetic generation of minority class data under spark using Locality Sensitivity Hashing [LSH]. We were able to successfully demonstrate a distributed version of Spark SMOTE which generated quality artificial samples preserving spatial distribution(1).
引用
收藏
页数:6
相关论文
共 50 条
  • [1] Distributed classification for imbalanced big data in distributed environments
    Wang, Huihui
    Xiao, Mingfei
    Wu, Changsheng
    Zhang, Jing
    [J]. WIRELESS NETWORKS, 2024, 30 (05) : 3657 - 3668
  • [2] A Classification Method of Imbalanced Big Data Based on Improved SMOTE and Stacked LSTM
    Xu, Wentao
    [J]. Journal of Network Intelligence, 2023, 8 (01): : 100 - 112
  • [3] SMOTE-GPU: Big Data preprocessing on commodity hardware for imbalanced classification
    Gutiérrez P.D.
    Lastra M.
    Benítez J.M.
    Herrera F.
    [J]. Progress in Artificial Intelligence, 2017, 6 (4) : 347 - 354
  • [4] ACTIVE SMOTE for Imbalanced Medical Data Classification
    Sena, Raul
    Ben Hamida, Sana
    [J]. ADVANCES IN INFORMATION SYSTEMS, ARTIFICIAL INTELLIGENCE AND KNOWLEDGE MANAGEMENT, ICIKS 2023, 2024, 486 : 81 - 97
  • [5] SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data
    Basgall, Maria Jose
    Hasperue, Waldo
    Naiouf, Marcelo
    Fernandez, Alberto
    Herrera, Francisco
    [J]. JOURNAL OF COMPUTER SCIENCE & TECHNOLOGY, 2018, 18 (03): : 203 - 209
  • [6] Enhanced SMOTE Algorithm for Classification of Imbalanced Big-Data using Random Forest
    Bhagat, Reshma C.
    Patil, Sachin S.
    [J]. 2015 IEEE INTERNATIONAL ADVANCE COMPUTING CONFERENCE (IACC), 2015, : 403 - 408
  • [7] DBSM: The Combination of DBSCAN and SMOTE for Imbalanced Data Classification
    Sanguanmak, Yotsathon
    Hanskunatai, Anantaporn
    [J]. 2016 13TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER SCIENCE AND SOFTWARE ENGINEERING (JCSSE), 2016, : 509 - 513
  • [8] Ensemble classification algorithm based improved SMOTE for imbalanced data
    [J]. Ning, Liu, 1600, Natsional'nyi Hirnychyi Universytet
  • [9] Imbalanced Data Classification using Random Subspace Method and SMOTE
    Huang, Hsiao-Yun
    Lin, Yi-Jhen
    Chen, Youg-Siang
    Lu, Hung-Yi
    [J]. 6TH INTERNATIONAL CONFERENCE ON SOFT COMPUTING AND INTELLIGENT SYSTEMS, AND THE 13TH INTERNATIONAL SYMPOSIUM ON ADVANCED INTELLIGENT SYSTEMS, 2012, : 817 - 820
  • [10] Research on expansion and classification of imbalanced data based on SMOTE algorithm
    Shujuan Wang
    Yuntao Dai
    Jihong Shen
    Jingxue Xuan
    [J]. Scientific Reports, 11