Threshold optimization and random undersampling for imbalanced credit card data

被引:6
|
作者
Leevy, Joffrey L. L. [1 ]
Johnson, Justin M. M. [1 ]
Hancock, John [1 ]
Khoshgoftaar, Taghi M. M. [1 ]
机构
[1] Florida Atlantic Univ, 777 Glades Rd, Boca Raton, FL 33431 USA
关键词
Output thresholding; Credit Card Fraud Detection Dataset; Random undersampling; Machine learning; PERFORMANCE; ALGORITHMS;
D O I
10.1186/s40537-023-00738-z
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Output thresholding is well-suited for addressing class imbalance, since the technique does not increase dataset size, run the risk of discarding important instances, or modify an existing learner. Through the use of the Credit Card Fraud Detection Dataset, this study proposes a threshold optimization approach that factors in the constraint True Positive Rate (TPR) >= True Negative Rate (TNR). Our findings indicate that an increase of the Area Under the Precision-Recall Curve (AUPRC) score is associated with an improvement in threshold-based classification scores, while an increase of positive class prior probability causes optimal thresholds to increase. In addition, we discovered that best overall results for the selection of an optimal threshold are obtained without the use of Random Undersampling (RUS). Furthermore, with the exception of AUPRC, we established that the default threshold yields good performance scores at a balanced class ratio. Our evaluation of four threshold optimization techniques, eight threshold-dependent metrics, and two threshold-agnostic metrics defines the uniqueness of this research.
引用
收藏
页数:22
相关论文
共 50 条
  • [21] Partial Undersampling of Imbalanced Data for Cyber Threats Detection
    Moniruzzaman, Md
    Bagirov, A. M.
    Gondal, Iqbal
    PROCEEDINGS OF THE AUSTRALASIAN COMPUTER SCIENCE WEEK MULTICONFERENCE (ACSW 2020), 2020,
  • [22] Undersampling Instance Selection for Hybrid and Incomplete Imbalanced Data
    Camacho-Nieto, Oscar
    Yanez-Marquez, Cornelio
    Villuendas-Rey, Yenny
    JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2020, 26 (06) : 698 - 719
  • [23] An Iterative Undersampling of Extremely Imbalanced Data Using CSVM
    Lee, Jong Bum
    Lee, Jee-Hyong
    SEVENTH INTERNATIONAL CONFERENCE ON MACHINE VISION (ICMV 2014), 2015, 9445
  • [24] A Membership Probability–Based Undersampling Algorithm for Imbalanced Data
    Gilseung Ahn
    You-Jin Park
    Sun Hur
    Journal of Classification, 2021, 38 : 2 - 15
  • [25] A Survey on GAN Techniques for Data Augmentation to Address the Imbalanced Data Issues in Credit Card Fraud Detection
    Strelcenia, Emilija
    Prakoonwit, Simant
    MACHINE LEARNING AND KNOWLEDGE EXTRACTION, 2023, 5 (01): : 304 - 329
  • [26] Radial-Based Undersampling for imbalanced data classification
    Koziarski, Michal
    PATTERN RECOGNITION, 2020, 102
  • [27] On Properties of Undersampling Bagging and Its Extensions for Imbalanced Data
    Stefanowski, Jerzy
    PROCEEDINGS OF THE 9TH INTERNATIONAL CONFERENCE ON COMPUTER RECOGNITION SYSTEMS, CORES 2015, 2016, 403 : 407 - 417
  • [28] Relevant information undersampling to support imbalanced data classification
    Hoyos-Osorio, J.
    Alvarez-Meza, A.
    Daza-Santacoloma, G.
    Orozco-Gutierrez, A.
    Castellanos-Dominguez, G.
    NEUROCOMPUTING, 2021, 436 : 136 - 146
  • [29] Exploring Maximum Tree Depth and Random Undersampling in Ensemble Trees to Optimize the Classification of Imbalanced Big Data
    Hancock J.T., III
    Khoshgoftaar T.M.
    SN Computer Science, 4 (5)
  • [30] A Generalized Optimization Embedded Framework of Undersampling Ensembles for Imbalanced Classification
    Guan, Hongjiao
    Zhang, Yingtao
    Ma, Bin
    Li, Jian
    Wang, Chunpeng
    2021 IEEE 8TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS (DSAA), 2021,