Severely imbalanced Big Data challenges: investigating data sampling approaches

被引:63
|
作者
Hasanin, Tawfiq [1 ]
Khoshgoftaar, Taghi M. [1 ]
Leevy, Joffrey L. [1 ]
Bauder, Richard A. [1 ]
机构
[1] Florida Atlantic Univ, 777 Glades Rd, Boca Raton, FL 33431 USA
基金
美国国家科学基金会;
关键词
Big Data; Class imbalance; Machine Learning; Medicare fraud; Oversampling; SlowlorisBig; Undersampling; CLASSIFICATION; MAPREDUCE; SMOTE;
D O I
10.1186/s40537-019-0274-4
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Severe class imbalance between majority and minority classes in Big Data can bias the predictive performance of Machine Learning algorithms toward the majority (negative) class. Where the minority (positive) class holds greater value than the majority (negative) class and the occurrence of false negatives incurs a greater penalty than false positives, the bias may lead to adverse consequences. Our paper incorporates two case studies, each utilizing three learners, six sampling approaches, two performance metrics, and five sampled distribution ratios, to uniquely investigate the effect of severe class imbalance on Big Data analytics. The learners (Gradient-Boosted Trees, Logistic Regression, Random Forest) were implemented within the Apache Spark framework. The first case study is based on a Medicare fraud detection dataset. The second case study, unlike the first, includes training data from one source (SlowlorisBig Dataset) and test data from a separate source (POST dataset). Results from the Medicare case study are not conclusive regarding the best sampling approach using Area Under the Receiver Operating Characteristic Curve and Geometric Mean performance metrics. However, it should be noted that the Random Undersampling approach performs adequately in the first case study. For the SlowlorisBig case study, Random Undersampling convincingly outperforms the other five sampling approaches (Random Oversampling, Synthetic Minority Over-sampling TEchnique, SMOTE-borderline1 , SMOTE-borderline2 , ADAptive SYNthetic) when measuring performance with Area Under the Receiver Operating Characteristic Curve and Geometric Mean metrics. Based on its classification performance in both case studies, Random Undersampling is the best choice as it results in models with a significantly smaller number of samples, thus reducing computational burden and training time.
引用
收藏
页数:25
相关论文
共 50 条
  • [1] Severely imbalanced Big Data challenges: investigating data sampling approaches
    Tawfiq Hasanin
    Taghi M. Khoshgoftaar
    Joffrey L. Leevy
    Richard A. Bauder
    Journal of Big Data, 6
  • [2] Data Sampling Approaches with Severely Imbalanced Big Data for Medicare Fraud Detection
    Bauder, Richard A.
    Khoshgoftaar, Taghi M.
    Hasanin, Tawfiq
    2018 IEEE 30TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI), 2018, : 137 - 142
  • [3] Deep Learning and Data Sampling with Imbalanced Big Data
    Johnson, Justin M.
    Khoshgoftaar, Taghi M.
    2019 IEEE 20TH INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION FOR DATA SCIENCE (IRI 2019), 2019, : 175 - 183
  • [4] The Effects of Data Sampling with Deep Learning and Highly Imbalanced Big Data
    Justin M. Johnson
    Taghi M. Khoshgoftaar
    Information Systems Frontiers, 2020, 22 : 1113 - 1131
  • [5] The Effects of Data Sampling with Deep Learning and Highly Imbalanced Big Data
    Johnson, Justin M.
    Khoshgoftaar, Taghi M.
    INFORMATION SYSTEMS FRONTIERS, 2020, 22 (05) : 1113 - 1131
  • [6] Investigating the effect of sampling methods for imbalanced data distributions
    Yen, Show-Jane
    Lee, Yue-Shi
    Lin, Cheng-Han
    Ying, Jia-Ching
    2006 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-6, PROCEEDINGS, 2006, : 4163 - 4168
  • [7] An insight into imbalanced Big Data classification: outcomes and challenges
    Alberto Fernández
    Sara del Río
    Nitesh V. Chawla
    Francisco Herrera
    Complex & Intelligent Systems, 2017, 3 : 105 - 120
  • [8] An insight into imbalanced Big Data classification: outcomes and challenges
    Fernandez, Alberto
    del Rio, Sara
    Chawla, Nitesh V.
    Herrera, Francisco
    COMPLEX & INTELLIGENT SYSTEMS, 2017, 3 (02) : 105 - 120
  • [9] Imbalanced data sampling design based on grid boundary domain for big data
    He, Hanji
    He, Jianfeng
    Zhang, Liwei
    COMPUTATIONAL STATISTICS, 2025, 40 (01) : 27 - 64
  • [10] HSDP: A Hybrid Sampling Method for Imbalanced Big Data Based on Data Partition
    Chen, Liping
    Jiang, Jiabao
    Zhang, Yong
    COMPLEXITY, 2021, 2021