Robust Thresholding Strategies for Highly Imbalanced and Noisy Data

被引:2
|
作者
Johnson, Justin M. [1 ]
Khoshgoftaar, Taghi M. [1 ]
机构
[1] Florida Atlantic Univ, Coll Engn & Comp Sci, Boca Raton, FL 33431 USA
关键词
Class Noise; Class Imbalance; Output Thresholding; Big Data; Medicare; Fraud Detection; CLASSIFICATION;
D O I
10.1109/ICMLA52953.2021.00192
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many studies have shown that non-default decision thresholds are required to maximize classification performance on highly imbalanced data sets. Thresholding strategies include using a threshold equal to the prior probability of the positive class or identifying an optimal threshold on training data. It is not clear, however, how these thresholding strategies will generalize to imbalanced data sets that contain class label noise. When class noise is present, the positive class prior is influenced by the class label noise, and a threshold that is optimized on noisy training data may not generalize to test data. We employ four thresholding strategies: two thresholds that are optimized on training data and two thresholds that depend on the positive class prior. Threshold strategies are evaluated on a range of noise levels and noise distributions using the Random Forest, Multilayer Perceptron, and XGBoost learners. While all four thresholding strategies significantly outperform the default threshold with respect to the Geometric Mean (G-Mean), three of the four thresholds yield unstable true positive rates (TPR) and true negative rates (TNR) in the presence of class noise. Results show that setting the threshold equal to the prior probability of the noisy positive class consistently performs best according to G-Mean, TPR, and TNR. This is the first evaluation of thresholding strategies for imbalanced and noisy data, to the best of our knowledge, and our results contradict related works that have suggested optimizing thresholds on training data as the best approach.
引用
收藏
页码:1182 / 1188
页数:7
相关论文
共 50 条
  • [1] A fuzzy classifier for imbalanced and noisy data
    Visa, S
    Ralescu, A
    [J]. 2004 IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS, VOLS 1-3, PROCEEDINGS, 2004, : 1727 - 1732
  • [2] Robust Visual Recognition with Class-Imbalanced Open-World Noisy Data
    Zhao, Na
    Lee, Gim Hee
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 15, 2024, : 16989 - 16997
  • [3] Software quality classification with imbalanced and noisy data
    Folleco, Andres
    Khoshgoftaar, Taghi M.
    Van Hulse, Jason
    [J]. THIRTEENTH ISSAT INTERNATIONAL CONFERENCE ON RELIABILITY AND QUALITY IN DESIGN, PROCEEDINGS, 2007, : 191 - +
  • [4] An exploration of learning when data is noisy and imbalanced
    Van Hulse, Jason
    Khoshgoftaar, Taghi M.
    Napolitano, Amri
    [J]. INTELLIGENT DATA ANALYSIS, 2011, 15 (02) : 215 - 236
  • [5] Knowledge discovery from imbalanced and noisy data
    Van Hulse, Jason
    Khoshgoftaar, Taghi
    [J]. DATA & KNOWLEDGE ENGINEERING, 2009, 68 (12) : 1513 - 1542
  • [6] Boosted RVM algorithm for imbalanced and noisy data
    Qin, Wangchen
    Tong, Mi
    Liu, Fang
    Qi, Quan
    [J]. 2018 5TH INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND CONTROL ENGINEERING (ICISCE 2018), 2018, : 151 - 155
  • [7] Output Thresholding for Ensemble Learners and Imbalanced Big Data
    Johnson, Justin M.
    Khoshgoftaar, Taghi M.
    [J]. 2021 IEEE 33RD INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2021), 2021, : 1449 - 1454
  • [8] Reconstruction of bandlimited signals from noisy data by thresholding
    Nguyen, VL
    Pawlak, M
    [J]. 2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL VI, PROCEEDINGS: SIGNAL PROCESSING THEORY AND METHODS, 2003, : 169 - 172
  • [9] Comparing Boosting and Bagging Techniques With Noisy and Imbalanced Data
    Khoshgoftaar, Taghi M.
    Van Hulse, Jason
    Napolitano, Amri
    [J]. IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART A-SYSTEMS AND HUMANS, 2011, 41 (03): : 552 - 568
  • [10] Kernel Logistic Regression: A Robust Weighting for Imbalanced Classes with Noisy Labels
    Byrnes, Paul G.
    DiazDelaO, Francisco A.
    [J]. 2018 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND DATA ENGINEERING (ICMLDE 2018), 2018, : 30 - 34