Imbalanced data issues in machine learning classifiers: a case study

被引：0

作者：

Gong, Mingxing ^{[1
]}

机构：

[1] Univ Delaware, Alfred Lerner Coll Business, Inst Financial Serv Analyt, Purnell Hall, Newark, DE 19716 USA

来源：

JOURNAL OF OPERATIONAL RISK | 2022年 / 17卷 / 04期

关键词：

machine learning; imbalanced data; fraud risk; performance measures; cost sensitive learning; CLASSIFICATION;

D O I：

10.21314/JOP.2022.027

中图分类号：

F8 [财政、金融];

学科分类号：

0202 ;

摘要：

Machine learning classifiers are widely used in financial applications. Due to the nature of certain classification problems, special care should be taken when dealing with imbalanced data. In practice, many model developers and validators fail to take this into account in their model development and validation. In addition, resampling is a common technique to address imbalanced data issues when building traditional logistic regression models. However, there has been no specific discussion regarding the resampling ratio used to rebalance the data or how the issue of imbalance impacts different kinds of machine learning classifiers, especially the more advanced ones. This paper aims to outline the special characteristics of the classifiers, compare different methods in dealing with imbalanced data issues and provide best practice in model development, evaluation and validation to avoid common pitfalls. Although the methods discussed in this paper can apply to general machine learning classifiers in applications with imbalanced data issues, by using a case study in credit card fraud detection this paper calls practitioners' attention to the imbalanced data problems therein, where class imbalance is often mistreated and lacks theoretical discussion.

引用

页码：17 / 36

页数：20

共 50 条

[41] Machine Learning Classifiers for Twitter Surveillance of Vaping: Comparative Machine Learning Study
Visweswaran, Shyam
Colditz, Jason B.
O'Halloran, Patrick
Han, Na-Rae
Taneja, Sanya B.
Welling, Joel
Chu, Kar-Hai
Sidani, Jaime E.
Primack, Brian A.
JOURNAL OF MEDICAL INTERNET RESEARCH, 2020, 22 (08)
[42] Evaluation of the Classifiers in Multiparameter and Imbalanced Data Sets
Piotrowska, Ewelina
INFORMATION SYSTEMS ARCHITECTURE AND TECHNOLOGY, ISAT 2019, PT II, 2020, 1051 : 263 - 273
[43] Limitation of ROC in Evaluation of Classifiers for Imbalanced Data
Movahedi, F.
Antaki, J. F.
JOURNAL OF HEART AND LUNG TRANSPLANTATION, 2021, 40 (04): : S413 - S413
[44] IMBALANCED DATA CLASSIFICATION BASED ON EXTREME LEARNING MACHINE AUTOENCODER
Shen, Chu
Zhang, Su-Fang
Zhai, Jun-Hal
Luo, Ding-Sheng
Chen, Jun-Fen
PROCEEDINGS OF 2018 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS (ICMLC), VOL 2, 2018, : 399 - 404
[45] An Explainable Machine Learning Pipeline for Stroke Prediction on Imbalanced Data
Kokkotis, Christos
Giarmatzis, Georgios
Giannakou, Erasmia
Moustakidis, Serafeim
Tsatalas, Themistoklis
Tsiptsios, Dimitrios
Vadikolias, Konstantinos
Aggelousis, Nikolaos
DIAGNOSTICS, 2022, 12 (10)
[46] Sampling Approaches for Imbalanced Data Classification Problem in Machine Learning
Tyagi, Shivani
Mittal, Sangeeta
PROCEEDINGS OF RECENT INNOVATIONS IN COMPUTING, ICRIC 2019, 2020, 597 : 209 - 221
[47] An improved weighted extreme learning machine for imbalanced data classification
Lu, Chengbo
Ke, Haifeng
Zhang, Gaoyan
Mei, Ying
Xu, Huihui
MEMETIC COMPUTING, 2019, 11 (01) : 27 - 34
[48] Online Automated Machine Learning for Class Imbalanced Data Streams
Wang, Zhaoyang
Wang, Shuo
2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
[49] On Machine Learning with Imbalanced Data and Research Quality Evaluation Methodologies
Lipitakis, Anastasia-Dimitra
Lipitakis, Evangelia A. E. C.
2014 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND COMPUTATIONAL INTELLIGENCE (CSCI), VOL 1, 2014, : 451 - 457
[50] An improved weighted extreme learning machine for imbalanced data classification
Chengbo Lu
Haifeng Ke
Gaoyan Zhang
Ying Mei
Huihui Xu
Memetic Computing, 2019, 11 : 27 - 34

← 1 2 3 4 5 →