Severely imbalanced Big Data challenges: investigating data sampling approaches

被引:63
|
作者
Hasanin, Tawfiq [1 ]
Khoshgoftaar, Taghi M. [1 ]
Leevy, Joffrey L. [1 ]
Bauder, Richard A. [1 ]
机构
[1] Florida Atlantic Univ, 777 Glades Rd, Boca Raton, FL 33431 USA
基金
美国国家科学基金会;
关键词
Big Data; Class imbalance; Machine Learning; Medicare fraud; Oversampling; SlowlorisBig; Undersampling; CLASSIFICATION; MAPREDUCE; SMOTE;
D O I
10.1186/s40537-019-0274-4
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Severe class imbalance between majority and minority classes in Big Data can bias the predictive performance of Machine Learning algorithms toward the majority (negative) class. Where the minority (positive) class holds greater value than the majority (negative) class and the occurrence of false negatives incurs a greater penalty than false positives, the bias may lead to adverse consequences. Our paper incorporates two case studies, each utilizing three learners, six sampling approaches, two performance metrics, and five sampled distribution ratios, to uniquely investigate the effect of severe class imbalance on Big Data analytics. The learners (Gradient-Boosted Trees, Logistic Regression, Random Forest) were implemented within the Apache Spark framework. The first case study is based on a Medicare fraud detection dataset. The second case study, unlike the first, includes training data from one source (SlowlorisBig Dataset) and test data from a separate source (POST dataset). Results from the Medicare case study are not conclusive regarding the best sampling approach using Area Under the Receiver Operating Characteristic Curve and Geometric Mean performance metrics. However, it should be noted that the Random Undersampling approach performs adequately in the first case study. For the SlowlorisBig case study, Random Undersampling convincingly outperforms the other five sampling approaches (Random Oversampling, Synthetic Minority Over-sampling TEchnique, SMOTE-borderline1 , SMOTE-borderline2 , ADAptive SYNthetic) when measuring performance with Area Under the Receiver Operating Characteristic Curve and Geometric Mean metrics. Based on its classification performance in both case studies, Random Undersampling is the best choice as it results in models with a significantly smaller number of samples, thus reducing computational burden and training time.
引用
收藏
页数:25
相关论文
共 50 条
  • [31] THE BIG CHALLENGES OF BIG DATA
    Marx, Vivien
    NATURE, 2013, 498 (7453) : 255 - 260
  • [32] Data reduction techniques for highly imbalanced medicare Big Data
    Hancock, John T.
    Wang, Huanjing
    Khoshgoftaar, Taghi M.
    Liang, Qianxin
    JOURNAL OF BIG DATA, 2024, 11 (01)
  • [33] Securing Big Data: New Access Control Challenges and Approaches
    Kantarcioglu, Murat
    PROCEEDINGS OF THE 24TH ACM SYMPOSIUM ON ACCESS CONTROL MODELS AND TECHNOLOGIES (SACMAT '19), 2019, : 1 - 2
  • [34] Approaches and Challenges of Big Data Analytics-Study of a Beginner
    Roy, Ankita
    Ray, Soumya
    Goswami, Radha Tamal
    PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND COMMUNICATION, 2017, 458 : 237 - 245
  • [35] Digital Forensics in the Age of Big Data: Challenges, Approaches, and Opportunities
    Zawoad, Shams
    Hasan, Ragib
    2015 IEEE 17TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS, 2015 IEEE 7TH INTERNATIONAL SYMPOSIUM ON CYBERSPACE SAFETY AND SECURITY, AND 2015 IEEE 12TH INTERNATIONAL CONFERENCE ON EMBEDDED SOFTWARE AND SYSTEMS (ICESS), 2015, : 1320 - 1325
  • [36] SHAP as a Data Reduction Technique for Highly Imbalanced Big Data
    Hancock, John T.
    Bauder, Richard A.
    Khoshgoftaar, Taghi M.
    INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2025,
  • [37] A comprehensive study of Big Data Machine Learning Approaches and Challenges
    Singh, Neelam
    Singh, Devesh Pratap
    Pant, Bhasker
    2017 INTERNATIONAL CONFERENCE ON NEXT GENERATION COMPUTING AND INFORMATION SYSTEMS (ICNGCIS), 2017, : 80 - 85
  • [38] Data reduction techniques for highly imbalanced medicare Big Data
    John T. Hancock
    Huanjing Wang
    Taghi M. Khoshgoftaar
    Qianxin Liang
    Journal of Big Data, 11
  • [39] Special issue on Machine learning approaches and challenges of missing data in the era of big data
    Gwanggil Jeon
    Arun Kumar Sangaiah
    You-Shyang Chen
    Anand Paul
    International Journal of Machine Learning and Cybernetics, 2019, 10 : 2589 - 2591
  • [40] Special issue on Machine learning approaches and challenges of missing data in the era of big data
    Jeon, Gwanggil
    Sangaiah, Arun Kumar
    Chen, You-Shyang
    Paul, Anand
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2019, 10 (10) : 2589 - 2591