Imbalanced data issues in machine learning classifiers: a case study

被引:0
|
作者
Gong, Mingxing [1 ]
机构
[1] Univ Delaware, Alfred Lerner Coll Business, Inst Financial Serv Analyt, Purnell Hall, Newark, DE 19716 USA
来源
JOURNAL OF OPERATIONAL RISK | 2022年 / 17卷 / 04期
关键词
machine learning; imbalanced data; fraud risk; performance measures; cost sensitive learning; CLASSIFICATION;
D O I
10.21314/JOP.2022.027
中图分类号
F8 [财政、金融];
学科分类号
0202 ;
摘要
Machine learning classifiers are widely used in financial applications. Due to the nature of certain classification problems, special care should be taken when dealing with imbalanced data. In practice, many model developers and validators fail to take this into account in their model development and validation. In addition, resampling is a common technique to address imbalanced data issues when building traditional logistic regression models. However, there has been no specific discussion regarding the resampling ratio used to rebalance the data or how the issue of imbalance impacts different kinds of machine learning classifiers, especially the more advanced ones. This paper aims to outline the special characteristics of the classifiers, compare different methods in dealing with imbalanced data issues and provide best practice in model development, evaluation and validation to avoid common pitfalls. Although the methods discussed in this paper can apply to general machine learning classifiers in applications with imbalanced data issues, by using a case study in credit card fraud detection this paper calls practitioners' attention to the imbalanced data problems therein, where class imbalance is often mistreated and lacks theoretical discussion.
引用
收藏
页码:17 / 36
页数:20
相关论文
共 50 条
  • [1] Machine-learning classifiers for imbalanced tornado data
    Trafalis T.B.
    Adrianto I.
    Richman M.B.
    Lakshmivarahan S.
    [J]. Computational Management Science, 2014, 11 (4) : 403 - 418
  • [2] Performance of Machine Learning Classifiers for Malware Detection Over Imbalanced Data
    Morillo, Paulina
    Bahamonde, Diego
    Tapia, Wilian
    [J]. INTELLIGENT SYSTEMS AND APPLICATIONS, VOL 1, INTELLISYS 2023, 2024, 822 : 496 - 507
  • [3] Learning classifiers from imbalanced data based on biased minimax probability machine
    Huang, KZ
    Yang, HQ
    King, I
    Lyu, MR
    [J]. PROCEEDINGS OF THE 2004 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 2, 2004, : 558 - 563
  • [4] Parallel classifiers ensemble with hierarchical machine learning for imbalanced classes
    Zhang, Yun
    Luo, Bing
    [J]. PROCEEDINGS OF 2008 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7, 2008, : 94 - 99
  • [5] Machine learning for mining imbalanced data
    Arafat, Md. Yasir
    Hoque, Sabera
    Xu, Shuxiang
    Farid, Dewan Md
    [J]. IAENG International Journal of Computer Science, 2019, 46 (02) : 332 - 348
  • [6] Comparative Study of Various Machine Learning Classifiers on Medical Data
    Karankar, Nilima
    Shukla, Pragya
    Agrawal, Niyati
    [J]. 2017 7TH INTERNATIONAL CONFERENCE ON COMMUNICATION SYSTEMS AND NETWORK TECHNOLOGIES (CSNT), 2017, : 267 - 271
  • [7] Active Learning with Abstaining Classifiers for Imbalanced Drifting Data Streams
    Korycki, Lukasz
    Cano, Alberto
    Krawczyk, Bartosz
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2019, : 2334 - 2343
  • [8] Fuzzy prototype selection-based classifiers for imbalanced data. Case study
    Rodriguez Alvarez, Yanela
    Garcia Lorenzo, Maria Matilde
    Caballero Mota, Yaile
    Filiberto Cabrera, Yaima
    Garcia Hilarion, Isabel M.
    Montes de Oca, Daniela Machado
    Bello Perez, Rafael
    [J]. PATTERN RECOGNITION LETTERS, 2022, 163 : 183 - 190
  • [9] Imbalanced data preprocessing techniques for machine learning: a systematic mapping study
    de Vargas, Vitor Werner
    Schneider Aranda, Jorge Arthur
    Costa, Ricardo dos Santos
    da Silva Pereira, Paulo Ricardo
    Victoria Barbosa, Jorge Luis
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2023, 65 (01) : 31 - 57
  • [10] Imbalanced Data Problem in Machine Learning: A Review
    Altalhan, Manahel
    Algarni, Abdulmohsen
    Turki-Hadj Alouane, Monia
    [J]. IEEE Access, 2025, 13 : 13686 - 13699