Robust Thresholding Strategies for Highly Imbalanced and Noisy Data

被引：2

作者：

Johnson, Justin M. ^{[1
]}

Khoshgoftaar, Taghi M. ^{[1
]}

机构：

[1] Florida Atlantic Univ, Coll Engn & Comp Sci, Boca Raton, FL 33431 USA

来源：

20TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2021) | 2021年

关键词：

Class Noise; Class Imbalance; Output Thresholding; Big Data; Medicare; Fraud Detection; CLASSIFICATION;

D O I：

10.1109/ICMLA52953.2021.00192

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Many studies have shown that non-default decision thresholds are required to maximize classification performance on highly imbalanced data sets. Thresholding strategies include using a threshold equal to the prior probability of the positive class or identifying an optimal threshold on training data. It is not clear, however, how these thresholding strategies will generalize to imbalanced data sets that contain class label noise. When class noise is present, the positive class prior is influenced by the class label noise, and a threshold that is optimized on noisy training data may not generalize to test data. We employ four thresholding strategies: two thresholds that are optimized on training data and two thresholds that depend on the positive class prior. Threshold strategies are evaluated on a range of noise levels and noise distributions using the Random Forest, Multilayer Perceptron, and XGBoost learners. While all four thresholding strategies significantly outperform the default threshold with respect to the Geometric Mean (G-Mean), three of the four thresholds yield unstable true positive rates (TPR) and true negative rates (TNR) in the presence of class noise. Results show that setting the threshold equal to the prior probability of the noisy positive class consistently performs best according to G-Mean, TPR, and TNR. This is the first evaluation of thresholding strategies for imbalanced and noisy data, to the best of our knowledge, and our results contradict related works that have suggested optimizing thresholds on training data as the best approach.

引用

页码：1182 / 1188

页数：7

共 50 条

[1] A fuzzy classifier for imbalanced and noisy data
Visa, S
Ralescu, A
[J]. 2004 IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS, VOLS 1-3, PROCEEDINGS, 2004, : 1727 - 1732
[2] Robust Visual Recognition with Class-Imbalanced Open-World Noisy Data
Zhao, Na
Lee, Gim Hee
[J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 15, 2024, : 16989 - 16997
[3] Software quality classification with imbalanced and noisy data
Folleco, Andres
Khoshgoftaar, Taghi M.
Van Hulse, Jason
[J]. THIRTEENTH ISSAT INTERNATIONAL CONFERENCE ON RELIABILITY AND QUALITY IN DESIGN, PROCEEDINGS, 2007, : 191 - +
[4] An exploration of learning when data is noisy and imbalanced
Van Hulse, Jason
Khoshgoftaar, Taghi M.
Napolitano, Amri
[J]. INTELLIGENT DATA ANALYSIS, 2011, 15 (02) : 215 - 236
[5] Knowledge discovery from imbalanced and noisy data
Van Hulse, Jason
Khoshgoftaar, Taghi
[J]. DATA & KNOWLEDGE ENGINEERING, 2009, 68 (12) : 1513 - 1542
[6] Boosted RVM algorithm for imbalanced and noisy data
Qin, Wangchen
Tong, Mi
Liu, Fang
Qi, Quan
[J]. 2018 5TH INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND CONTROL ENGINEERING (ICISCE 2018), 2018, : 151 - 155
[7] Output Thresholding for Ensemble Learners and Imbalanced Big Data
Johnson, Justin M.
Khoshgoftaar, Taghi M.
[J]. 2021 IEEE 33RD INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2021), 2021, : 1449 - 1454
[8] Reconstruction of bandlimited signals from noisy data by thresholding
Nguyen, VL
Pawlak, M
[J]. 2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL VI, PROCEEDINGS: SIGNAL PROCESSING THEORY AND METHODS, 2003, : 169 - 172
[9] Comparing Boosting and Bagging Techniques With Noisy and Imbalanced Data
Khoshgoftaar, Taghi M.
Van Hulse, Jason
Napolitano, Amri
[J]. IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART A-SYSTEMS AND HUMANS, 2011, 41 (03): : 552 - 568
[10] Kernel Logistic Regression: A Robust Weighting for Imbalanced Classes with Noisy Labels
Byrnes, Paul G.
DiazDelaO, Francisco A.
[J]. 2018 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND DATA ENGINEERING (ICMLDE 2018), 2018, : 30 - 34

← 1 2 3 4 5 →