Robust Thresholding Strategies for Highly Imbalanced and Noisy Data

被引:2
|
作者
Johnson, Justin M. [1 ]
Khoshgoftaar, Taghi M. [1 ]
机构
[1] Florida Atlantic Univ, Coll Engn & Comp Sci, Boca Raton, FL 33431 USA
关键词
Class Noise; Class Imbalance; Output Thresholding; Big Data; Medicare; Fraud Detection; CLASSIFICATION;
D O I
10.1109/ICMLA52953.2021.00192
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many studies have shown that non-default decision thresholds are required to maximize classification performance on highly imbalanced data sets. Thresholding strategies include using a threshold equal to the prior probability of the positive class or identifying an optimal threshold on training data. It is not clear, however, how these thresholding strategies will generalize to imbalanced data sets that contain class label noise. When class noise is present, the positive class prior is influenced by the class label noise, and a threshold that is optimized on noisy training data may not generalize to test data. We employ four thresholding strategies: two thresholds that are optimized on training data and two thresholds that depend on the positive class prior. Threshold strategies are evaluated on a range of noise levels and noise distributions using the Random Forest, Multilayer Perceptron, and XGBoost learners. While all four thresholding strategies significantly outperform the default threshold with respect to the Geometric Mean (G-Mean), three of the four thresholds yield unstable true positive rates (TPR) and true negative rates (TNR) in the presence of class noise. Results show that setting the threshold equal to the prior probability of the noisy positive class consistently performs best according to G-Mean, TPR, and TNR. This is the first evaluation of thresholding strategies for imbalanced and noisy data, to the best of our knowledge, and our results contradict related works that have suggested optimizing thresholds on training data as the best approach.
引用
收藏
页码:1182 / 1188
页数:7
相关论文
共 50 条
  • [21] Obtaining Robust Models from Imbalanced Data
    Wang, Wentao
    [J]. WSDM'22: PROCEEDINGS OF THE FIFTEENTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, 2022, : 1555 - 1556
  • [22] Robust Optimization for Multilingual Translation with Imbalanced Data
    Li, Xian
    Gong, Hongyu
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [23] Improved randomized learning algorithms for imbalanced and noisy educational data classification
    Ming Li
    Changqin Huang
    Dianhui Wang
    Qintai Hu
    Jia Zhu
    Yong Tang
    [J]. Computing, 2019, 101 : 571 - 585
  • [24] Improved randomized learning algorithms for imbalanced and noisy educational data classification
    Li, Ming
    Huang, Changqin
    Wang, Dianhui
    Hu, Qintai
    Zhu, Jia
    Tang, Yong
    [J]. COMPUTING, 2019, 101 (06) : 571 - 585
  • [25] Dealing with Small, Noisy and Imbalanced Data Machine Learning or Manual Grammars?
    Przepiorkowski, Adam
    Marcinczuk, Michal
    Degorski, Lukasz
    [J]. TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2008, 5246 : 169 - +
  • [26] Knowledge discovery from noisy imbalanced and incomplete binary class data
    Puri, Arjun
    Gupta, Manoj Kumar
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2021, 181
  • [27] Imbalanced Multiple Noisy Labeling
    Zhang, Jing
    Wu, Xindong
    Sheng, Victor S.
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2015, 27 (02) : 489 - 503
  • [28] Robust Rescaled Hinge Loss Twin Support Vector Machine for imbalanced Noisy Classification
    Huang, Ling-Wei
    Shao, Yuan-Hai
    Zhang, Jun
    Zhao, Yu-Ting
    Teng, Jia-Ying
    [J]. IEEE ACCESS, 2019, 7 : 65390 - 65404
  • [29] Robust Learning of Deep Predictive Models from Noisy and Imbalanced Software Engineering Datasets
    Li, Zhong
    Pan, Minxue
    Pei, Yu
    Zhang, Tian
    Wang, Linzhang
    Li, Xuandong
    [J]. PROCEEDINGS OF THE 37TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING, ASE 2022, 2022,
  • [30] Data reduction techniques for highly imbalanced medicare Big Data
    Hancock, John T.
    Wang, Huanjing
    Khoshgoftaar, Taghi M.
    Liang, Qianxin
    [J]. JOURNAL OF BIG DATA, 2024, 11 (01)