Improved KD-tree based imbalanced big data classification and oversampling for MapReduce platforms

被引:0
|
作者
Sleeman, William C. [1 ]
Roseberry, Martha [2 ]
Ghosh, Preetam [2 ]
Cano, Alberto [2 ]
Krawczyk, Bartosz [3 ]
机构
[1] Virginia Commonwealth Univ, Dept Radiat Oncol, Richmond, VA 23284 USA
[2] Virginia Commonwealth Univ, Dept Comp Sci, Richmond, VA USA
[3] Rochester Inst Technol, Ctr Imaging Sci, Rochester, NY USA
关键词
Apache Spark; Amazon web services; Imbalanced data; k-dimensional trees; Machine learning; SMOTE; NEAREST-NEIGHBOR CLASSIFICATION; SPARK; GPU; SMOTE;
D O I
10.1007/s10489-024-05763-w
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the era of big data, it is necessary to provide novel and efficient platforms for training machine learning models over large volumes of data. The MapReduce approach and its Apache Spark implementation are among the most popular methods that provide high-performance computing for classification algorithms. However, they require dedicated implementations that will take advantage of such architectures. Additionally, many real-world big data problems are plagued by class imbalance, posing challenges to the classifier training step. Existing solutions for alleviating skewed distributions do not work well in the MapReduce environment. In this paper, we propose a novel KD-tree based classifier, together with a variation of the SMOTE algorithm dedicated to the Spark platform. Our algorithms offer excellent predictive power and can work simultaneously with binary and multi-class imbalanced data. Exhaustive experiments conducted using the Amazon Web Service platform showcase the high efficiency and flexibility of our proposed algorithms.
引用
收藏
页码:12558 / 12575
页数:18
相关论文
共 50 条
  • [41] Reversible data hiding in compressed and encrypted images by using Kd-tree
    Nasrullah, Nasrullah
    Sang, Jun
    Mateen, Muhammad
    Akbar, Muhammad Azeem
    Xiang, Hong
    Xia, Xiaofeng
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (13) : 17535 - 17554
  • [42] Kd-tree based fast ray tracing for RCS prediction
    Tao, Y. B.
    Lin, H.
    Bao, H. J.
    [J]. PROGRESS IN ELECTROMAGNETICS RESEARCH-PIER, 2008, 81 : 329 - 341
  • [43] On the use of MapReduce for imbalanced big data using Random Forest
    del Rio, Sara
    Lopez, Victoria
    Manuel Benitez, Jose
    Herrera, Francisco
    [J]. INFORMATION SCIENCES, 2014, 285 : 112 - 137
  • [44] Imbalanced Learning with Oversampling based on Classification Contribution Degree
    Jiang, Zhenhao
    Yang, Jie
    Liu, Yan
    [J]. ADVANCED THEORY AND SIMULATIONS, 2021, 4 (05)
  • [45] An oversampling framework for imbalanced classification based on Laplacian eigenmaps
    Ye, Xiucai
    Li, Hongmin
    Imakura, Akira
    Sakurai, Tetsuya
    [J]. NEUROCOMPUTING, 2020, 399 : 107 - 116
  • [46] Counterfactual-based minority oversampling for imbalanced classification
    Wang, Shu
    Luo, Hao
    Huang, Shanshan
    Li, Qingsong
    Liu, Li
    Su, Guoxin
    Liu, Ming
    [J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 122
  • [47] Kd-tree and adaptive radius (KD-AR Stream) based real-time data stream clustering
    Senol, Ali
    Karacan, Hacer
    [J]. JOURNAL OF THE FACULTY OF ENGINEERING AND ARCHITECTURE OF GAZI UNIVERSITY, 2020, 35 (01): : 337 - 354
  • [48] Data Parallelization of Kd-tree Ray Tracing on the Cell Broadband Engine
    Pang, Yi
    Sun, Lifeng
    Yang, Shiqiang
    [J]. ICME: 2009 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-3, 2009, : 1246 - 1249
  • [49] Reversible data hiding in compressed and encrypted images by using Kd-tree
    Nasrullah Nasrullah
    Jun Sang
    Muhammad Mateen
    Muhammad Azeem Akbar
    Hong Xiang
    Xiaofeng Xia
    [J]. Multimedia Tools and Applications, 2019, 78 : 17535 - 17554
  • [50] An improved and random synthetic minority oversampling technique for imbalanced data
    Wei, Guoliang
    Mu, Weimeng
    Song, Yan
    Dou, Jun
    [J]. KNOWLEDGE-BASED SYSTEMS, 2022, 248