Improved KD-tree based imbalanced big data classification and oversampling for MapReduce platforms

被引:0
|
作者
Sleeman, William C. [1 ]
Roseberry, Martha [2 ]
Ghosh, Preetam [2 ]
Cano, Alberto [2 ]
Krawczyk, Bartosz [3 ]
机构
[1] Virginia Commonwealth Univ, Dept Radiat Oncol, Richmond, VA 23284 USA
[2] Virginia Commonwealth Univ, Dept Comp Sci, Richmond, VA USA
[3] Rochester Inst Technol, Ctr Imaging Sci, Rochester, NY USA
关键词
Apache Spark; Amazon web services; Imbalanced data; k-dimensional trees; Machine learning; SMOTE; NEAREST-NEIGHBOR CLASSIFICATION; SPARK; GPU; SMOTE;
D O I
10.1007/s10489-024-05763-w
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the era of big data, it is necessary to provide novel and efficient platforms for training machine learning models over large volumes of data. The MapReduce approach and its Apache Spark implementation are among the most popular methods that provide high-performance computing for classification algorithms. However, they require dedicated implementations that will take advantage of such architectures. Additionally, many real-world big data problems are plagued by class imbalance, posing challenges to the classifier training step. Existing solutions for alleviating skewed distributions do not work well in the MapReduce environment. In this paper, we propose a novel KD-tree based classifier, together with a variation of the SMOTE algorithm dedicated to the Spark platform. Our algorithms offer excellent predictive power and can work simultaneously with binary and multi-class imbalanced data. Exhaustive experiments conducted using the Amazon Web Service platform showcase the high efficiency and flexibility of our proposed algorithms.
引用
收藏
页码:12558 / 12575
页数:18
相关论文
共 50 条
  • [1] Mining Hidden Communities in Social Networks Using KD-Tree and Improved KD-Tree
    Devi, Renuga R.
    Hemalatha, M.
    [J]. 2013 FOURTH INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATIONS AND NETWORKING TECHNOLOGIES (ICCCNT), 2013,
  • [2] Efficient hybrid oversampling and intelligent undersampling for imbalanced big data classification
    Vairetti, Carla
    Assadi, Jose Luis
    Maldonado, Sebastian
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2024, 246
  • [3] Gaussian Distribution Based Oversampling for Imbalanced Data Classification
    Xie, Yuxi
    Qiu, Min
    Zhang, Haibo
    Peng, Lizhi
    Chen, Zhenxiang
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2022, 34 (02) : 667 - 679
  • [4] Why does SBVH outperform KD-tree on parallel platforms?
    Breglia, Alfonso
    Capozzoli, Amedeo
    Curcio, Claudio
    Liseno, Angelo
    [J]. 2016 IEEE/ACES INTERNATIONAL CONFERENCE ON WIRELESS INFORMATION TECHNOLOGY AND SYSTEMS (ICWITS) AND APPLIED COMPUTATIONAL ELECTROMAGNETICS (ACES), 2016,
  • [5] An Improved D2GAN-based oversampling algorithm for imbalanced data classification
    Zhao, Xiaoqiang
    Yao, Qinglei
    [J]. STATISTICAL ANALYSIS AND DATA MINING, 2023, 16 (06) : 569 - 582
  • [6] Optimised kd-tree indexing of multimedia data
    Reiss, JD
    Selbie, J
    Sandler, MB
    [J]. Digital Media: Processing Multimedia Interactive Services, 2003, : 47 - 52
  • [7] Adaptive Oversampling for Imbalanced Data Classification
    Ertekin, Seyda
    [J]. INFORMATION SCIENCES AND SYSTEMS 2013, 2013, 264 : 261 - 269
  • [8] Analysis of Data Preprocessing Increasing the Oversampling Ratio for Extremely Imbalanced Big Data Classification
    del Rio, Sara
    Benitez, Jose M.
    Herrera, Francisco
    [J]. 2015 IEEE TRUSTCOM/BIGDATASE/ISPA, VOL 2, 2015, : 180 - 185
  • [9] TREE POINT CLOUDS REGISTRATION USING AN IMPROVED ICP ALGORITHM BASED ON KD-TREE
    Li, Shihua
    Wang, Jingxian
    Liang, Zuqin
    Su, Lian
    [J]. 2016 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS), 2016, : 4545 - 4548
  • [10] Radial-Based Oversampling for Multiclass Imbalanced Data Classification
    Krawczyk, Bartosz
    Koziarski, Michal
    Wozniak, Michal
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2020, 31 (08) : 2818 - 2831