Severely imbalanced Big Data challenges: investigating data sampling approaches

被引：63

作者：

Hasanin, Tawfiq ^{[1
]}

Khoshgoftaar, Taghi M. ^{[1
]}

Leevy, Joffrey L. ^{[1
]}

Bauder, Richard A. ^{[1
]}

机构：

[1] Florida Atlantic Univ, 777 Glades Rd, Boca Raton, FL 33431 USA

来源：

JOURNAL OF BIG DATA | 2019年 / 6卷 / 01期

基金：

美国国家科学基金会;

关键词：

Big Data; Class imbalance; Machine Learning; Medicare fraud; Oversampling; SlowlorisBig; Undersampling; CLASSIFICATION; MAPREDUCE; SMOTE;

D O I：

10.1186/s40537-019-0274-4

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Severe class imbalance between majority and minority classes in Big Data can bias the predictive performance of Machine Learning algorithms toward the majority (negative) class. Where the minority (positive) class holds greater value than the majority (negative) class and the occurrence of false negatives incurs a greater penalty than false positives, the bias may lead to adverse consequences. Our paper incorporates two case studies, each utilizing three learners, six sampling approaches, two performance metrics, and five sampled distribution ratios, to uniquely investigate the effect of severe class imbalance on Big Data analytics. The learners (Gradient-Boosted Trees, Logistic Regression, Random Forest) were implemented within the Apache Spark framework. The first case study is based on a Medicare fraud detection dataset. The second case study, unlike the first, includes training data from one source (SlowlorisBig Dataset) and test data from a separate source (POST dataset). Results from the Medicare case study are not conclusive regarding the best sampling approach using Area Under the Receiver Operating Characteristic Curve and Geometric Mean performance metrics. However, it should be noted that the Random Undersampling approach performs adequately in the first case study. For the SlowlorisBig case study, Random Undersampling convincingly outperforms the other five sampling approaches (Random Oversampling, Synthetic Minority Over-sampling TEchnique, SMOTE-borderline1 , SMOTE-borderline2 , ADAptive SYNthetic) when measuring performance with Area Under the Receiver Operating Characteristic Curve and Geometric Mean metrics. Based on its classification performance in both case studies, Random Undersampling is the best choice as it results in models with a significantly smaller number of samples, thus reducing computational burden and training time.

引用

页数：25

共 50 条

[31] THE BIG CHALLENGES OF BIG DATA
Marx, Vivien
NATURE, 2013, 498 (7453) : 255 - 260
[32] Data reduction techniques for highly imbalanced medicare Big Data
Hancock, John T.
Wang, Huanjing
Khoshgoftaar, Taghi M.
Liang, Qianxin
JOURNAL OF BIG DATA, 2024, 11 (01)
[33] Securing Big Data: New Access Control Challenges and Approaches
Kantarcioglu, Murat
PROCEEDINGS OF THE 24TH ACM SYMPOSIUM ON ACCESS CONTROL MODELS AND TECHNOLOGIES (SACMAT '19), 2019, : 1 - 2
[34] Approaches and Challenges of Big Data Analytics-Study of a Beginner
Roy, Ankita
Ray, Soumya
Goswami, Radha Tamal
PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND COMMUNICATION, 2017, 458 : 237 - 245
[35] Digital Forensics in the Age of Big Data: Challenges, Approaches, and Opportunities
Zawoad, Shams
Hasan, Ragib
2015 IEEE 17TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS, 2015 IEEE 7TH INTERNATIONAL SYMPOSIUM ON CYBERSPACE SAFETY AND SECURITY, AND 2015 IEEE 12TH INTERNATIONAL CONFERENCE ON EMBEDDED SOFTWARE AND SYSTEMS (ICESS), 2015, : 1320 - 1325
[36] SHAP as a Data Reduction Technique for Highly Imbalanced Big Data
Hancock, John T.
Bauder, Richard A.
Khoshgoftaar, Taghi M.
INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2025,
[37] A comprehensive study of Big Data Machine Learning Approaches and Challenges
Singh, Neelam
Singh, Devesh Pratap
Pant, Bhasker
2017 INTERNATIONAL CONFERENCE ON NEXT GENERATION COMPUTING AND INFORMATION SYSTEMS (ICNGCIS), 2017, : 80 - 85
[38] Data reduction techniques for highly imbalanced medicare Big Data
John T. Hancock
Huanjing Wang
Taghi M. Khoshgoftaar
Qianxin Liang
Journal of Big Data, 11
[39] Special issue on Machine learning approaches and challenges of missing data in the era of big data
Gwanggil Jeon
Arun Kumar Sangaiah
You-Shyang Chen
Anand Paul
International Journal of Machine Learning and Cybernetics, 2019, 10 : 2589 - 2591
[40] Special issue on Machine learning approaches and challenges of missing data in the era of big data
Jeon, Gwanggil
Sangaiah, Arun Kumar
Chen, You-Shyang
Paul, Anand
INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2019, 10 (10) : 2589 - 2591

← 1 2 3 4 5 →