Investigating class rarity in big data

被引:0
|
作者
Tawfiq Hasanin
Taghi M. Khoshgoftaar
Joffrey L. Leevy
Richard A. Bauder
机构
[1] Florida Atlantic University,
来源
关键词
Big data; Class imbalance; Machine learning; Medicare fraud; POSTSlowloris; Class rarity; Undersampling;
D O I
暂无
中图分类号
学科分类号
摘要
In Machine Learning, if one class has a significantly larger number of instances (majority) than the other (minority), this condition is defined as class imbalance. With regard to datasets, class imbalance can bias the predictive capabilities of Machine Learning algorithms towards the majority (negative) class, and in situations where false negatives incur a greater penalty than false positives, this imbalance may lead to adverse consequences. Our paper incorporates two case studies, each utilizing a unique approach of three learners (gradient-boosted trees, logistic regression, random forest) and three performance metrics (Area Under the Receiver Operating Characteristic Curve, Area Under the Precision-Recall Curve, Geometric Mean) to investigate class rarity in big data. Class rarity, a notably extreme degree of class imbalance, was effected in our experiments by randomly removing minority (positive) instances to artificially generate eight subsets of gradually decreasing positive class instances. All model evaluations were performed through Cross-Validation. In the first case study, which uses a Medicare Part B dataset, performance scores for the learners generally improve with the Area Under the Receiver Operating Characteristic Curve metric as the rarity level decreases, while corresponding scores with the Area Under the Precision-Recall Curve and Geometric Mean metrics show no improvement. In the second case study, which uses a dataset built from Distributed Denial of Service attack attack data (POSTSlowloris Combined), the Area Under the Receiver Operating Characteristic Curve metric produces very high-performance scores for the learners, with all subsets of positive class instances. For the second study, scores for the learners generally improve with the Area Under the Precision-Recall Curve and Geometric Mean metrics as the rarity level decreases. Overall, with regard to both case studies, the Gradient-Boosted Trees (GBT) learner performs the best.
引用
收藏
相关论文
共 50 条
  • [21] Accounting for the social: Investigating commensuration and Big Data practices at Facebook
    van der Vlist, Fernando N.
    [J]. BIG DATA & SOCIETY, 2016, 3 (01): : 1 - 16
  • [22] A framework for investigating optimization of service parts performance with big data
    Christopher A. Boone
    Benjamin T. Hazen
    Joseph B. Skipper
    Robert E. Overstreet
    [J]. Annals of Operations Research, 2018, 270 : 65 - 74
  • [23] Online sparse class imbalance learning on big data
    Maurya, Chandresh Kumar
    Toshniwal, Durga
    Venkoparao, Gopalan Vijendran
    [J]. NEUROCOMPUTING, 2016, 216 : 250 - 260
  • [24] Computing rarity on uncertain data
    JIN CheQing
    [J]. Science China(Information Sciences), 2011, 54 (10) : 2028 - 2039
  • [25] Data Governance in the Health Industry: Investigating Data Quality Dimensions within a Big Data Context
    Juddoo, Suraj
    George, Carlisle
    Duquenoy, Penny
    Windridge, David
    [J]. APPLIED SYSTEM INNOVATION, 2018, 1 (04) : 1 - 16
  • [26] Computing rarity on uncertain data
    CheQing Jin
    MinQi Zhou
    AoYing Zhou
    [J]. Science China Information Sciences, 2011, 54 : 2028 - 2039
  • [27] Computing rarity on uncertain data
    Jin CheQing
    Zhou MinQi
    Zhou AoYing
    [J]. SCIENCE CHINA-INFORMATION SCIENCES, 2011, 54 (10) : 2028 - 2039
  • [28] A Big Data Approach for Investigating Bridge Deterioration and Maintenance Strategies in Taiwan
    Chuang, Yu-Han
    Yau, Nie-Jia
    Tabor, John Mark M.
    [J]. SUSTAINABILITY, 2023, 15 (02)
  • [29] Investigating the predictive ability of ONS big data-based indicators
    Kapetanios, George
    Papailias, Fotis
    [J]. JOURNAL OF FORECASTING, 2022, 41 (02) : 252 - 258
  • [30] Investigating Business Intelligence in the era of Big Data: concepts, benefits and challenges
    El Bousty, Hicham
    Krit, Salah-ddine
    Elasikri, Mohamed
    Dani, Hassan
    Karimi, Khaoula
    Bendaoud, Kaoutar
    Kabrane, Mustapha
    [J]. ICEMIS'18: PROCEEDINGS OF THE FOURTH INTERNATIONAL CONFERENCE ON ENGINEERING AND MIS, 2018,