Robust Thresholding Strategies for Highly Imbalanced and Noisy Data

被引：2

作者：

Johnson, Justin M. ^{[1
]}

Khoshgoftaar, Taghi M. ^{[1
]}

机构：

[1] Florida Atlantic Univ, Coll Engn & Comp Sci, Boca Raton, FL 33431 USA

来源：

20TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2021) | 2021年

关键词：

Class Noise; Class Imbalance; Output Thresholding; Big Data; Medicare; Fraud Detection; CLASSIFICATION;

D O I：

10.1109/ICMLA52953.2021.00192

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Many studies have shown that non-default decision thresholds are required to maximize classification performance on highly imbalanced data sets. Thresholding strategies include using a threshold equal to the prior probability of the positive class or identifying an optimal threshold on training data. It is not clear, however, how these thresholding strategies will generalize to imbalanced data sets that contain class label noise. When class noise is present, the positive class prior is influenced by the class label noise, and a threshold that is optimized on noisy training data may not generalize to test data. We employ four thresholding strategies: two thresholds that are optimized on training data and two thresholds that depend on the positive class prior. Threshold strategies are evaluated on a range of noise levels and noise distributions using the Random Forest, Multilayer Perceptron, and XGBoost learners. While all four thresholding strategies significantly outperform the default threshold with respect to the Geometric Mean (G-Mean), three of the four thresholds yield unstable true positive rates (TPR) and true negative rates (TNR) in the presence of class noise. Results show that setting the threshold equal to the prior probability of the noisy positive class consistently performs best according to G-Mean, TPR, and TNR. This is the first evaluation of thresholding strategies for imbalanced and noisy data, to the best of our knowledge, and our results contradict related works that have suggested optimizing thresholds on training data as the best approach.

引用

页码：1182 / 1188

页数：7

共 50 条

[21] Obtaining Robust Models from Imbalanced Data
Wang, Wentao
[J]. WSDM'22: PROCEEDINGS OF THE FIFTEENTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, 2022, : 1555 - 1556
[22] Robust Optimization for Multilingual Translation with Imbalanced Data
Li, Xian
Gong, Hongyu
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[23] Improved randomized learning algorithms for imbalanced and noisy educational data classification
Ming Li
Changqin Huang
Dianhui Wang
Qintai Hu
Jia Zhu
Yong Tang
[J]. Computing, 2019, 101 : 571 - 585
[24] Improved randomized learning algorithms for imbalanced and noisy educational data classification
Li, Ming
Huang, Changqin
Wang, Dianhui
Hu, Qintai
Zhu, Jia
Tang, Yong
[J]. COMPUTING, 2019, 101 (06) : 571 - 585
[25] Dealing with Small, Noisy and Imbalanced Data Machine Learning or Manual Grammars?
Przepiorkowski, Adam
Marcinczuk, Michal
Degorski, Lukasz
[J]. TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2008, 5246 : 169 - +
[26] Knowledge discovery from noisy imbalanced and incomplete binary class data
Puri, Arjun
Gupta, Manoj Kumar
[J]. EXPERT SYSTEMS WITH APPLICATIONS, 2021, 181
[27] Imbalanced Multiple Noisy Labeling
Zhang, Jing
Wu, Xindong
Sheng, Victor S.
[J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2015, 27 (02) : 489 - 503
[28] Robust Rescaled Hinge Loss Twin Support Vector Machine for imbalanced Noisy Classification
Huang, Ling-Wei
Shao, Yuan-Hai
Zhang, Jun
Zhao, Yu-Ting
Teng, Jia-Ying
[J]. IEEE ACCESS, 2019, 7 : 65390 - 65404
[29] Robust Learning of Deep Predictive Models from Noisy and Imbalanced Software Engineering Datasets
Li, Zhong
Pan, Minxue
Pei, Yu
Zhang, Tian
Wang, Linzhang
Li, Xuandong
[J]. PROCEEDINGS OF THE 37TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING, ASE 2022, 2022,
[30] Data reduction techniques for highly imbalanced medicare Big Data
Hancock, John T.
Wang, Huanjing
Khoshgoftaar, Taghi M.
Liang, Qianxin
[J]. JOURNAL OF BIG DATA, 2024, 11 (01)

← 1 2 3 4 5 →