Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

被引:28
|
作者
de Vargas, Vitor Werner [1 ]
Schneider Aranda, Jorge Arthur [1 ]
Costa, Ricardo dos Santos [2 ]
da Silva Pereira, Paulo Ricardo [2 ]
Victoria Barbosa, Jorge Luis [1 ,2 ]
机构
[1] Univ Vale Rio dos Sinos, Appl Comp Grad Program, BR-93022750 Sao Leopoldo, RS, Brazil
[2] Univ Vale Rio dos Sinos, Elect Engn Grad Program, BR-93022750 Sao Leopoldo, RS, Brazil
关键词
Imbalanced data; Preprocessing techniques; Sampling; Machine learning; Systematic mapping study; PERFORMANCE; PREDICTION;
D O I
10.1007/s10115-022-01772-8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Machine Learning (ML) algorithms have been increasingly replacing people in several application domains-in which the majority suffer from data imbalance. In order to solve this problem, published studies implement data preprocessing techniques, cost-sensitive and ensemble learning. These solutions reduce the naturally occurring bias towards the majority sample through ML. This study uses a systematic mapping methodology to assess 9927 papers related to sampling techniques for ML in imbalanced data applications from 7 digital libraries. A filtering process selected 35 representative papers from various domains, such as health, finance, and engineering. As a result of a thorough quantitative analysis of these papers, this study proposes two taxonomies-illustrating sampling techniques and ML models. The results indicate that oversampling and classical ML are the most common preprocessing techniques and models, respectively. However, solutions with neural networks and ensemble ML models have the best performance-with potentially better results through hybrid sampling techniques. Finally, none of the 35 works apply simulation-based synthetic oversampling, indicating a path for future preprocessing solutions.
引用
收藏
页码:31 / 57
页数:27
相关论文
共 50 条
  • [1] Imbalanced data preprocessing techniques for machine learning: a systematic mapping study
    Vitor Werner de Vargas
    Jorge Arthur Schneider Aranda
    Ricardo dos Santos Costa
    Paulo Ricardo da Silva Pereira
    Jorge Luis Victória Barbosa
    [J]. Knowledge and Information Systems, 2023, 65 : 31 - 57
  • [2] Systematic literature review of preprocessing techniques for imbalanced data
    Felix, Ebubeogu Amarachukwu
    Lee, Sai Peck
    [J]. IET SOFTWARE, 2019, 13 (06) : 479 - 496
  • [3] A comparative analysis of machine learning techniques for imbalanced data
    Mrad, Ali Ben
    Lahiani, Amine
    Mefteh-Wali, Salma
    Mselmi, Nada
    [J]. ANNALS OF OPERATIONS RESEARCH, 2024,
  • [4] Machine Learning Techniques in Optical Networks: A Systematic Mapping Study
    Villa, Genesis
    Tipantuna, Christian
    Guaman, Danny S.
    Arevalo, German V.
    Arguero, Berenice
    [J]. IEEE ACCESS, 2023, 11 : 98714 - 98750
  • [5] Machine Learning Techniques for Code Smells Detection: A Systematic Mapping Study
    Caram, Frederico Luiz
    De Oliveira Rodrigues, Bruno Rafael
    Campanelli, Amadeu Silveira
    Parreiras, Fernando Silva
    [J]. INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING, 2019, 29 (02) : 285 - 316
  • [6] A Study on the Prediction of Characteristics of Molding Sand Using Machine Learning and Data Preprocessing Techniques
    Lee, Jeong-Min
    Kim, Moon-Jo
    Choe, Kyeong-Hwan
    Kim, DongEung
    [J]. KOREAN JOURNAL OF METALS AND MATERIALS, 2023, 61 (01): : 18 - 27
  • [7] Addressing imbalanced data for machine learning based mineral prospectivity mapping
    University of Turku, Department of Computing, 20014, Finland
    不详
    [J]. Ore Geol. Rev., 2024,
  • [8] Effect of Data Preprocessing in the Detection of Epilepsy using Machine Learning Techniques
    Sabarivani, A.
    Ramadevi, R.
    Pandian, R.
    Krishnamoorthy, N. R.
    [J]. JOURNAL OF SCIENTIFIC & INDUSTRIAL RESEARCH, 2021, 80 (12): : 1066 - 1077
  • [9] Data preprocessing and feature selection techniques in gait recognition: A comparative study of machine learning and deep learning approaches
    Parashar, Anubha
    Parashar, Apoorva
    Ding, Weiping
    Shabaz, Mohammad
    Rida, Imad
    [J]. PATTERN RECOGNITION LETTERS, 2023, 172 : 65 - 73
  • [10] A Systematic Review on Imbalanced Data Challenges in Machine Learning: Applications and Solutions
    Kaur, Harsurinder
    Pannu, Husanbir Singh
    Malhi, Avleen Kaur
    [J]. ACM COMPUTING SURVEYS, 2019, 52 (04)