Imbalanced data issues in machine learning classifiers: a case study

被引:0
|
作者
Gong, Mingxing [1 ]
机构
[1] Univ Delaware, Alfred Lerner Coll Business, Inst Financial Serv Analyt, Purnell Hall, Newark, DE 19716 USA
来源
JOURNAL OF OPERATIONAL RISK | 2022年 / 17卷 / 04期
关键词
machine learning; imbalanced data; fraud risk; performance measures; cost sensitive learning; CLASSIFICATION;
D O I
10.21314/JOP.2022.027
中图分类号
F8 [财政、金融];
学科分类号
0202 ;
摘要
Machine learning classifiers are widely used in financial applications. Due to the nature of certain classification problems, special care should be taken when dealing with imbalanced data. In practice, many model developers and validators fail to take this into account in their model development and validation. In addition, resampling is a common technique to address imbalanced data issues when building traditional logistic regression models. However, there has been no specific discussion regarding the resampling ratio used to rebalance the data or how the issue of imbalance impacts different kinds of machine learning classifiers, especially the more advanced ones. This paper aims to outline the special characteristics of the classifiers, compare different methods in dealing with imbalanced data issues and provide best practice in model development, evaluation and validation to avoid common pitfalls. Although the methods discussed in this paper can apply to general machine learning classifiers in applications with imbalanced data issues, by using a case study in credit card fraud detection this paper calls practitioners' attention to the imbalanced data problems therein, where class imbalance is often mistreated and lacks theoretical discussion.
引用
收藏
页码:17 / 36
页数:20
相关论文
共 50 条
  • [31] Types of minority class examples and their influence on learning classifiers from imbalanced data
    Krystyna Napierala
    Jerzy Stefanowski
    Journal of Intelligent Information Systems, 2016, 46 : 563 - 597
  • [32] State-of-the-art Machine Learning Classifiers: A Comparative Study on health data
    Maqsood, Sadia
    Malik, Muhammad Sheraz Arshad
    INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2019, 19 (09): : 96 - 102
  • [33] Distributed and Weighted Extreme Learning Machine for Imbalanced Big Data Learning
    Zhiqiong Wang
    Junchang Xin
    Hongxu Yang
    Shuo Tian
    Ge Yu
    Chenren Xu
    Yudong Yao
    Tsinghua Science and Technology, 2017, 22 (02) : 160 - 173
  • [34] Distributed Weighted Extreme Learning Machine for Big Imbalanced Data Learning
    Wang, Zhiqiong
    Xin, Junchang
    Tian, Shuo
    Yu, Ge
    PROCEEDINGS OF ELM-2015, VOL 1: THEORY, ALGORITHMS AND APPLICATIONS (I), 2016, 6 : 319 - 332
  • [35] Distributed and Weighted Extreme Learning Machine for Imbalanced Big Data Learning
    Wang, Zhiqiong
    Xin, Junchang
    Yang, Hongxu
    Tian, Shuo
    Yu, Ge
    Xu, Chenren
    Yao, Yudong
    TSINGHUA SCIENCE AND TECHNOLOGY, 2017, 22 (02) : 160 - 173
  • [36] A Study of Machine Learning Classifiers for Spam Detection
    Trivedi, Shrawan Kumar
    2016 4TH INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL AND BUSINESS INTELLIGENCE (ISCBI), 2016, : 176 - 180
  • [37] The effect of imbalanced data class distribution on fuzzy classifiers - Experimental study
    Visa, S
    Ralescu, A
    FUZZ-IEEE 2005: PROCEEDINGS OF THE IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS: BIGGEST LITTLE CONFERENCE IN THE WORLD, 2005, : 749 - 754
  • [38] Investigation of Imbalanced Sentiment Analysis in Voice Data: A Comparative Study of Machine Learning Algorithms
    Shah, Viraj Nishchal
    Shah, Deep Rahul
    Shetty, Mayank Umesh
    Krishnan, Deepa
    Ravi, Vinayakumar
    Singh, Swapnil
    EAI Endorsed Transactions on Scalable Information Systems, 2024, 11 (06) : 1 - 12
  • [39] Collective of Base Classifiers for Mining Imbalanced Data
    Jedrzejowicz, Joanna
    Jedrzejowicz, Piotr
    COMPUTATIONAL SCIENCE, ICCS 2022, PT II, 2022, : 571 - 585
  • [40] Balanced Neighborhood Classifiers for Imbalanced Data Sets
    Zhu, Shunzhi
    Ma, Ying
    Pan, Weiwei
    Zhu, Xiatian
    Luo, Guangchun
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2014, E97D (12): : 3226 - 3229