Imbalanced data issues in machine learning classifiers: a case study

被引:0
|
作者
Gong, Mingxing [1 ]
机构
[1] Univ Delaware, Alfred Lerner Coll Business, Inst Financial Serv Analyt, Purnell Hall, Newark, DE 19716 USA
来源
JOURNAL OF OPERATIONAL RISK | 2022年 / 17卷 / 04期
关键词
machine learning; imbalanced data; fraud risk; performance measures; cost sensitive learning; CLASSIFICATION;
D O I
10.21314/JOP.2022.027
中图分类号
F8 [财政、金融];
学科分类号
0202 ;
摘要
Machine learning classifiers are widely used in financial applications. Due to the nature of certain classification problems, special care should be taken when dealing with imbalanced data. In practice, many model developers and validators fail to take this into account in their model development and validation. In addition, resampling is a common technique to address imbalanced data issues when building traditional logistic regression models. However, there has been no specific discussion regarding the resampling ratio used to rebalance the data or how the issue of imbalance impacts different kinds of machine learning classifiers, especially the more advanced ones. This paper aims to outline the special characteristics of the classifiers, compare different methods in dealing with imbalanced data issues and provide best practice in model development, evaluation and validation to avoid common pitfalls. Although the methods discussed in this paper can apply to general machine learning classifiers in applications with imbalanced data issues, by using a case study in credit card fraud detection this paper calls practitioners' attention to the imbalanced data problems therein, where class imbalance is often mistreated and lacks theoretical discussion.
引用
收藏
页码:17 / 36
页数:20
相关论文
共 50 条
  • [41] Machine Learning Classifiers for Twitter Surveillance of Vaping: Comparative Machine Learning Study
    Visweswaran, Shyam
    Colditz, Jason B.
    O'Halloran, Patrick
    Han, Na-Rae
    Taneja, Sanya B.
    Welling, Joel
    Chu, Kar-Hai
    Sidani, Jaime E.
    Primack, Brian A.
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2020, 22 (08)
  • [42] Evaluation of the Classifiers in Multiparameter and Imbalanced Data Sets
    Piotrowska, Ewelina
    INFORMATION SYSTEMS ARCHITECTURE AND TECHNOLOGY, ISAT 2019, PT II, 2020, 1051 : 263 - 273
  • [43] Limitation of ROC in Evaluation of Classifiers for Imbalanced Data
    Movahedi, F.
    Antaki, J. F.
    JOURNAL OF HEART AND LUNG TRANSPLANTATION, 2021, 40 (04): : S413 - S413
  • [44] IMBALANCED DATA CLASSIFICATION BASED ON EXTREME LEARNING MACHINE AUTOENCODER
    Shen, Chu
    Zhang, Su-Fang
    Zhai, Jun-Hal
    Luo, Ding-Sheng
    Chen, Jun-Fen
    PROCEEDINGS OF 2018 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS (ICMLC), VOL 2, 2018, : 399 - 404
  • [45] An Explainable Machine Learning Pipeline for Stroke Prediction on Imbalanced Data
    Kokkotis, Christos
    Giarmatzis, Georgios
    Giannakou, Erasmia
    Moustakidis, Serafeim
    Tsatalas, Themistoklis
    Tsiptsios, Dimitrios
    Vadikolias, Konstantinos
    Aggelousis, Nikolaos
    DIAGNOSTICS, 2022, 12 (10)
  • [46] Sampling Approaches for Imbalanced Data Classification Problem in Machine Learning
    Tyagi, Shivani
    Mittal, Sangeeta
    PROCEEDINGS OF RECENT INNOVATIONS IN COMPUTING, ICRIC 2019, 2020, 597 : 209 - 221
  • [47] An improved weighted extreme learning machine for imbalanced data classification
    Lu, Chengbo
    Ke, Haifeng
    Zhang, Gaoyan
    Mei, Ying
    Xu, Huihui
    MEMETIC COMPUTING, 2019, 11 (01) : 27 - 34
  • [48] Online Automated Machine Learning for Class Imbalanced Data Streams
    Wang, Zhaoyang
    Wang, Shuo
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [49] On Machine Learning with Imbalanced Data and Research Quality Evaluation Methodologies
    Lipitakis, Anastasia-Dimitra
    Lipitakis, Evangelia A. E. C.
    2014 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND COMPUTATIONAL INTELLIGENCE (CSCI), VOL 1, 2014, : 451 - 457
  • [50] An improved weighted extreme learning machine for imbalanced data classification
    Chengbo Lu
    Haifeng Ke
    Gaoyan Zhang
    Ying Mei
    Huihui Xu
    Memetic Computing, 2019, 11 : 27 - 34