Imbalanced data issues in machine learning classifiers: a case study

被引:0
|
作者
Gong, Mingxing [1 ]
机构
[1] Univ Delaware, Alfred Lerner Coll Business, Inst Financial Serv Analyt, Purnell Hall, Newark, DE 19716 USA
来源
JOURNAL OF OPERATIONAL RISK | 2022年 / 17卷 / 04期
关键词
machine learning; imbalanced data; fraud risk; performance measures; cost sensitive learning; CLASSIFICATION;
D O I
10.21314/JOP.2022.027
中图分类号
F8 [财政、金融];
学科分类号
0202 ;
摘要
Machine learning classifiers are widely used in financial applications. Due to the nature of certain classification problems, special care should be taken when dealing with imbalanced data. In practice, many model developers and validators fail to take this into account in their model development and validation. In addition, resampling is a common technique to address imbalanced data issues when building traditional logistic regression models. However, there has been no specific discussion regarding the resampling ratio used to rebalance the data or how the issue of imbalance impacts different kinds of machine learning classifiers, especially the more advanced ones. This paper aims to outline the special characteristics of the classifiers, compare different methods in dealing with imbalanced data issues and provide best practice in model development, evaluation and validation to avoid common pitfalls. Although the methods discussed in this paper can apply to general machine learning classifiers in applications with imbalanced data issues, by using a case study in credit card fraud detection this paper calls practitioners' attention to the imbalanced data problems therein, where class imbalance is often mistreated and lacks theoretical discussion.
引用
下载
收藏
页码:17 / 36
页数:20
相关论文
共 50 条
  • [21] Evidential Combination of Classifiers for Imbalanced Data
    Niu, Jiawei
    Liu, Zhunga
    Lu, Yao
    Wen, Zaidao
    IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2022, 52 (12): : 7642 - 7653
  • [22] A study in machine learning from imbalanced data for sentence boundary detection in speech
    Liu, Yang
    Chawla, Nitesh V.
    Harper, Mary R.
    Shriberg, Elizabeth
    Stolcke, Andreas
    COMPUTER SPEECH AND LANGUAGE, 2006, 20 (04): : 468 - 494
  • [23] A machine learning case study to predict rare clinical event of interest: imbalanced data, interpretability, and practical considerations
    Zhong, Sheng
    Zhang, Jane
    Jiao, Jenny
    Zhu, Hongjian
    Xing, Yunzhao
    Wang, Li
    JOURNAL OF BIOPHARMACEUTICAL STATISTICS, 2024,
  • [24] Adversarial Approaches to Tackle Imbalanced Data in Machine Learning
    Ayoub, Shahnawaz
    Gulzar, Yonis
    Rustamov, Jaloliddin
    Jabbari, Abdoh
    Reegu, Faheem Ahmad
    Turaev, Sherzod
    SUSTAINABILITY, 2023, 15 (09)
  • [25] Evolutionary Online Machine Learning from Imbalanced Data
    Stein, Anthony
    2016 IEEE 1ST INTERNATIONAL WORKSHOPS ON FOUNDATIONS AND APPLICATIONS OF SELF* SYSTEMS (FAS*W), 2016, : 281 - 286
  • [26] A comparative analysis of machine learning techniques for imbalanced data
    Mrad, Ali Ben
    Lahiani, Amine
    Mefteh-Wali, Salma
    Mselmi, Nada
    ANNALS OF OPERATIONS RESEARCH, 2024,
  • [27] An Improved Extreme Learning Machine for Imbalanced Data Classification
    Zhang, Xiaopeng
    Qin, Liangxi
    IEEE ACCESS, 2022, 10 : 8634 - 8642
  • [28] A machine learning method for incomplete and imbalanced medical data
    Salman, Issam
    Vomlel, Jiri
    PROCEEDINGS OF THE 20TH CZECH-JAPAN SEMINAR ON DATA ANALYSIS AND DECISION MAKING UNDER UNCERTAINTY, 2017, : 188 - 195
  • [29] High dimensional classifiers in the imbalanced case
    Bak, Britta Anker
    Jensen, Jens Ledet
    COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2016, 98 : 46 - 59
  • [30] Types of minority class examples and their influence on learning classifiers from imbalanced data
    Napierala, Krystyna
    Stefanowski, Jerzy
    JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2016, 46 (03) : 563 - 597