Severely imbalanced Big Data challenges: investigating data sampling approaches

被引：63

作者：

Hasanin, Tawfiq ^{[1
]}

Khoshgoftaar, Taghi M. ^{[1
]}

Leevy, Joffrey L. ^{[1
]}

Bauder, Richard A. ^{[1
]}

机构：

[1] Florida Atlantic Univ, 777 Glades Rd, Boca Raton, FL 33431 USA

来源：

JOURNAL OF BIG DATA | 2019年 / 6卷 / 01期

基金：

美国国家科学基金会;

关键词：

Big Data; Class imbalance; Machine Learning; Medicare fraud; Oversampling; SlowlorisBig; Undersampling; CLASSIFICATION; MAPREDUCE; SMOTE;

D O I：

10.1186/s40537-019-0274-4

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Severe class imbalance between majority and minority classes in Big Data can bias the predictive performance of Machine Learning algorithms toward the majority (negative) class. Where the minority (positive) class holds greater value than the majority (negative) class and the occurrence of false negatives incurs a greater penalty than false positives, the bias may lead to adverse consequences. Our paper incorporates two case studies, each utilizing three learners, six sampling approaches, two performance metrics, and five sampled distribution ratios, to uniquely investigate the effect of severe class imbalance on Big Data analytics. The learners (Gradient-Boosted Trees, Logistic Regression, Random Forest) were implemented within the Apache Spark framework. The first case study is based on a Medicare fraud detection dataset. The second case study, unlike the first, includes training data from one source (SlowlorisBig Dataset) and test data from a separate source (POST dataset). Results from the Medicare case study are not conclusive regarding the best sampling approach using Area Under the Receiver Operating Characteristic Curve and Geometric Mean performance metrics. However, it should be noted that the Random Undersampling approach performs adequately in the first case study. For the SlowlorisBig case study, Random Undersampling convincingly outperforms the other five sampling approaches (Random Oversampling, Synthetic Minority Over-sampling TEchnique, SMOTE-borderline1 , SMOTE-borderline2 , ADAptive SYNthetic) when measuring performance with Area Under the Receiver Operating Characteristic Curve and Geometric Mean metrics. Based on its classification performance in both case studies, Random Undersampling is the best choice as it results in models with a significantly smaller number of samples, thus reducing computational burden and training time.

引用

页数：25

共 50 条

[1] Severely imbalanced Big Data challenges: investigating data sampling approaches
Tawfiq Hasanin
Taghi M. Khoshgoftaar
Joffrey L. Leevy
Richard A. Bauder
Journal of Big Data, 6
[2] Data Sampling Approaches with Severely Imbalanced Big Data for Medicare Fraud Detection
Bauder, Richard A.
Khoshgoftaar, Taghi M.
Hasanin, Tawfiq
2018 IEEE 30TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI), 2018, : 137 - 142
[3] Deep Learning and Data Sampling with Imbalanced Big Data
Johnson, Justin M.
Khoshgoftaar, Taghi M.
2019 IEEE 20TH INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION FOR DATA SCIENCE (IRI 2019), 2019, : 175 - 183
[4] The Effects of Data Sampling with Deep Learning and Highly Imbalanced Big Data
Justin M. Johnson
Taghi M. Khoshgoftaar
Information Systems Frontiers, 2020, 22 : 1113 - 1131
[5] The Effects of Data Sampling with Deep Learning and Highly Imbalanced Big Data
Johnson, Justin M.
Khoshgoftaar, Taghi M.
INFORMATION SYSTEMS FRONTIERS, 2020, 22 (05) : 1113 - 1131
[6] Investigating the effect of sampling methods for imbalanced data distributions
Yen, Show-Jane
Lee, Yue-Shi
Lin, Cheng-Han
Ying, Jia-Ching
2006 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-6, PROCEEDINGS, 2006, : 4163 - 4168
[7] An insight into imbalanced Big Data classification: outcomes and challenges
Alberto Fernández
Sara del Río
Nitesh V. Chawla
Francisco Herrera
Complex & Intelligent Systems, 2017, 3 : 105 - 120
[8] An insight into imbalanced Big Data classification: outcomes and challenges
Fernandez, Alberto
del Rio, Sara
Chawla, Nitesh V.
Herrera, Francisco
COMPLEX & INTELLIGENT SYSTEMS, 2017, 3 (02) : 105 - 120
[9] Imbalanced data sampling design based on grid boundary domain for big data
He, Hanji
He, Jianfeng
Zhang, Liwei
COMPUTATIONAL STATISTICS, 2025, 40 (01) : 27 - 64
[10] HSDP: A Hybrid Sampling Method for Imbalanced Big Data Based on Data Partition
Chen, Liping
Jiang, Jiabao
Zhang, Yong
COMPLEXITY, 2021, 2021

← 1 2 3 4 5 →