Robust Thresholding Strategies for Highly Imbalanced and Noisy Data

被引:2
|
作者
Johnson, Justin M. [1 ]
Khoshgoftaar, Taghi M. [1 ]
机构
[1] Florida Atlantic Univ, Coll Engn & Comp Sci, Boca Raton, FL 33431 USA
关键词
Class Noise; Class Imbalance; Output Thresholding; Big Data; Medicare; Fraud Detection; CLASSIFICATION;
D O I
10.1109/ICMLA52953.2021.00192
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many studies have shown that non-default decision thresholds are required to maximize classification performance on highly imbalanced data sets. Thresholding strategies include using a threshold equal to the prior probability of the positive class or identifying an optimal threshold on training data. It is not clear, however, how these thresholding strategies will generalize to imbalanced data sets that contain class label noise. When class noise is present, the positive class prior is influenced by the class label noise, and a threshold that is optimized on noisy training data may not generalize to test data. We employ four thresholding strategies: two thresholds that are optimized on training data and two thresholds that depend on the positive class prior. Threshold strategies are evaluated on a range of noise levels and noise distributions using the Random Forest, Multilayer Perceptron, and XGBoost learners. While all four thresholding strategies significantly outperform the default threshold with respect to the Geometric Mean (G-Mean), three of the four thresholds yield unstable true positive rates (TPR) and true negative rates (TNR) in the presence of class noise. Results show that setting the threshold equal to the prior probability of the noisy positive class consistently performs best according to G-Mean, TPR, and TNR. This is the first evaluation of thresholding strategies for imbalanced and noisy data, to the best of our knowledge, and our results contradict related works that have suggested optimizing thresholds on training data as the best approach.
引用
收藏
页码:1182 / 1188
页数:7
相关论文
共 50 条
  • [41] Precise and Robust Line Detection for Highly Distorted and Noisy Images
    Wolters, Dominik
    Koch, Reinhard
    [J]. PATTERN RECOGNITION, GCPR 2016, 2016, 9796 : 3 - 13
  • [42] Evaluating classifier performance with highly imbalanced Big Data
    John T. Hancock
    Taghi M. Khoshgoftaar
    Justin M. Johnson
    [J]. Journal of Big Data, 10
  • [43] An Ensemble Tree Classifier for Highly Imbalanced Data Classification
    SHI Peibei
    WANG Zhong
    [J]. Journal of Systems Science & Complexity, 2021, 34 (06) : 2250 - 2266
  • [44] Fault Detection and Diagnosis with Imbalanced and Noisy Data: A Hybrid Framework for Rotating Machinery
    Jalayer, Masoud
    Kaboli, Amin
    Orsenigo, Carlotta
    Vercellis, Carlo
    [J]. MACHINES, 2022, 10 (04)
  • [45] Evaluating classifier performance with highly imbalanced Big Data
    Hancock, John T.
    Khoshgoftaar, Taghi M.
    Johnson, Justin M.
    [J]. JOURNAL OF BIG DATA, 2023, 10 (01)
  • [46] Survey on Highly Imbalanced Multi-class Data
    Hamid, Hakim Abdul
    Yusoff, Marina
    Mohamed, Azlinah
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (06) : 211 - 229
  • [47] Learning Robust Classifier for Imbalanced Medical Image Dataset with Noisy Labels by Minimizing Invariant Risk
    Li, Jinpeng
    Cao, Hanqun
    Wang, Jiaze
    Liu, Furui
    Dou, Qi
    Chen, Guangyong
    Pheng-Ann Heng
    [J]. MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT VI, 2023, 14225 : 306 - 316
  • [48] Multiset Feature Learning for Highly Imbalanced Data Classification
    Wu, Fei
    Jing, Xiao-Yuan
    Shan, Shiguang
    Zuo, Wangmeng
    Yang, Jing-Yu
    [J]. THIRTY-FIRST AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 1583 - 1589
  • [49] Multiset Feature Learning for Highly Imbalanced Data Classification
    Jing, Xiao-Yuan
    Zhang, Xinyu
    Zhu, Xiaoke
    Wu, Fei
    You, Xinge
    Gao, Yang
    Shan, Shiguang
    Yang, Jing-Yu
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (01) : 139 - 156
  • [50] An Ensemble Tree Classifier for Highly Imbalanced Data Classification
    Peibei Shi
    Zhong Wang
    [J]. Journal of Systems Science and Complexity, 2021, 34 : 2250 - 2266