Investigating class rarity in big data

被引:0
|
作者
Tawfiq Hasanin
Taghi M. Khoshgoftaar
Joffrey L. Leevy
Richard A. Bauder
机构
[1] Florida Atlantic University,
来源
关键词
Big data; Class imbalance; Machine learning; Medicare fraud; POSTSlowloris; Class rarity; Undersampling;
D O I
暂无
中图分类号
学科分类号
摘要
In Machine Learning, if one class has a significantly larger number of instances (majority) than the other (minority), this condition is defined as class imbalance. With regard to datasets, class imbalance can bias the predictive capabilities of Machine Learning algorithms towards the majority (negative) class, and in situations where false negatives incur a greater penalty than false positives, this imbalance may lead to adverse consequences. Our paper incorporates two case studies, each utilizing a unique approach of three learners (gradient-boosted trees, logistic regression, random forest) and three performance metrics (Area Under the Receiver Operating Characteristic Curve, Area Under the Precision-Recall Curve, Geometric Mean) to investigate class rarity in big data. Class rarity, a notably extreme degree of class imbalance, was effected in our experiments by randomly removing minority (positive) instances to artificially generate eight subsets of gradually decreasing positive class instances. All model evaluations were performed through Cross-Validation. In the first case study, which uses a Medicare Part B dataset, performance scores for the learners generally improve with the Area Under the Receiver Operating Characteristic Curve metric as the rarity level decreases, while corresponding scores with the Area Under the Precision-Recall Curve and Geometric Mean metrics show no improvement. In the second case study, which uses a dataset built from Distributed Denial of Service attack attack data (POSTSlowloris Combined), the Area Under the Receiver Operating Characteristic Curve metric produces very high-performance scores for the learners, with all subsets of positive class instances. For the second study, scores for the learners generally improve with the Area Under the Precision-Recall Curve and Geometric Mean metrics as the rarity level decreases. Overall, with regard to both case studies, the Gradient-Boosted Trees (GBT) learner performs the best.
引用
收藏
相关论文
共 50 条
  • [1] Investigating class rarity in big data
    Hasanin, Tawfiq
    Khoshgoftaar, Taghi M.
    Leevy, Joffrey L.
    Bauder, Richard A.
    [J]. JOURNAL OF BIG DATA, 2020, 7 (01)
  • [2] An Empirical Study on Class Rarity in Big Data
    Bauder, Richard A.
    Khoshgoftaar, Taghi M.
    Hasanin, Tawfiq
    [J]. 2018 17TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2018, : 785 - 790
  • [3] RACE AND CLASS IN BIG DATA
    Harper, Logan J.
    Culver, Daniel A.
    Cozier, Yvette C.
    [J]. SARCOIDOSIS VASCULITIS AND DIFFUSE LUNG DISEASES, 2024, 41 (02)
  • [4] Investigating the Role of Big Data in Transportation Safety
    Das, Subasish
    Griffin, Greg P.
    [J]. TRANSPORTATION RESEARCH RECORD, 2020, 2674 (06) : 244 - 252
  • [5] So how big is big? Investigating the impact of class size on ratings in student evaluation
    Gannaway, Deanne
    Green, Teegan
    Mertova, Patricie
    [J]. ASSESSMENT & EVALUATION IN HIGHER EDUCATION, 2018, 43 (02) : 175 - 184
  • [6] Investigating the Impact of Big Data Analytics on Recruitment Practices
    Faizi, Rdouan
    El Fkihi, Sanaa
    [J]. VISION 2025: EDUCATION EXCELLENCE AND MANAGEMENT OF INNOVATIONS THROUGH SUSTAINABLE ECONOMIC COMPETITIVE ADVANTAGE, 2019, : 6164 - 6170
  • [7] Investigating the Incorporation of Big Data in Management Information Systems
    Staegemann, Daniel
    Feuersenger, Hannes
    Volk, Matthias
    Liedtke, Patrick
    Arndt, Hans-Knud
    Turowski, Klaus
    [J]. BUSINESS INFORMATION SYSTEMS WORKSHOPS, BIS 2021, 2022, 444 : 109 - 120
  • [8] Investigating the Adoption of Big Data Management in Healthcare in Jordan
    Bani-Salameh, Hani
    Al-Qawagneh, Mona
    Taamneh, Salah
    [J]. DATA, 2021, 6 (02) : 1 - 16
  • [9] Investigating the Perceived Innovation of the Big Data Technology in Healthcare
    Gallos, Parisis
    Minou, John
    Routsis, Fotios
    Mantas, John
    [J]. INFORMATICS EMPOWERS HEALTHCARE TRANSFORMATION, 2017, 238 : 151 - 153
  • [10] Severely imbalanced Big Data challenges: investigating data sampling approaches
    Hasanin, Tawfiq
    Khoshgoftaar, Taghi M.
    Leevy, Joffrey L.
    Bauder, Richard A.
    [J]. JOURNAL OF BIG DATA, 2019, 6 (01)