A Study on the Impact of Data Characteristics in Imbalanced Regression Tasks

被引:5
|
作者
Branco, Paula [1 ]
Torgo, Luis [1 ]
机构
[1] Dalhousie Univ, Fac Comp Sci, Halifax, NS, Canada
关键词
Imbalance regression; pre-processing strategies; data characteristics;
D O I
10.1109/DSAA.2019.00034
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The class imbalance problem has been thoroughly studied over the past two decades. More recently, the research community realized that the problem of imbalanced distributions also occurred in other tasks beyond classification. Regression problems are among these newly studied tasks where the problem of imbalanced domains also poses important challenges. Imbalanced regression problems occur in a diversity of real world domains such as meteorological (predicting weather extreme values), financial (extreme stock returns forecasting) or medical (anticipate rare values). In imbalanced regression the end-user preferences are biased towards values of the target variable that are under-represented on the available data. Several pre-processing methods were proposed to address this problem. These methods change the training set to force the learner to focus on the rare cases. However, as far as we know, the relationship between the data intrinsic characteristics and the performance achieved by these methods has not yet been studied for imbalanced regression tasks. In this paper we describe a study of the impact certain data characteristics may have in the results of applying pre-processing methods to imbalanced regression problems. To achieve this goal, we define potentially interesting data characteristics of regression problems. We then conduct our study using a synthetic data repository build for this purpose. We show that all the different characteristics studied have a different behaviour that is related with the level at which the data characteristic is present and the learning algorithm used. The main contributions of our work are: i) to define interesting data characteristics for regression tasks; ii) to create the first repository of imbalanced regression tasks containing 6000 data sets with controlled data characteristics; and iii) to provide insights on the impact of intrinsic data characteristics in the results of pre-processing methods for handling imbalanced regression tasks.
引用
收藏
页码:193 / 202
页数:10
相关论文
共 50 条
  • [1] The Impact of Local Data Characteristics on Learning from Imbalanced Data
    Stefanowski, Jerzy
    ROUGH SETS AND INTELLIGENT SYSTEMS PARADIGMS, RSEISP 2014, 2014, 8537 : 1 - 13
  • [2] Oversampling techniques for imbalanced data in regression
    Belhaouari, Samir Brahim
    Islam, Ashhadul
    Kassoul, Khelil
    Al-Fuqaha, Ala
    Bouzerdoum, Abdesselam
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 252
  • [3] Data Complexity Measures for Imbalanced Classification Tasks
    Barella, Victor H.
    Garcia, Luis P. F.
    de Souto, Marcilio P.
    Lorena, Ana C.
    de Carvalho, Andre
    2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2018,
  • [4] Dice Loss for Data-imbalanced NLP Tasks
    Li, Xiaoya
    Sun, Xiaofei
    Meng, Yuxian
    Liang, Junjun
    Wu, Fei
    Li, Jiwei
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 465 - 476
  • [5] Research on Imbalanced Data Regression Based on Confrontation
    Liu, Xiaowen
    Tian, Huixin
    PROCESSES, 2024, 12 (02)
  • [6] IRDA: Implicit data augmentation for deep imbalanced regression
    Zhu, Weiyao
    Wu, Ou
    Yang, Nan
    INFORMATION SCIENCES, 2024, 677
  • [7] On the Importance of Adaptive Data Collection for Extremely Imbalanced Pairwise Tasks
    Mussmann, Stephen
    Jia, Robin
    Liang, Percy
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020,
  • [8] Multi-output regression for imbalanced data stream
    Peng, Tao
    Sellami, Sana
    Boucelma, Omar
    Chbeir, Richard
    EXPERT SYSTEMS, 2023, 40 (10)
  • [9] Chebyshev approaches for imbalanced data streams regression models
    Ehsan Aminian
    Rita P. Ribeiro
    João Gama
    Data Mining and Knowledge Discovery, 2021, 35 : 2389 - 2466
  • [10] Chebyshev approaches for imbalanced data streams regression models
    Aminian, Ehsan
    Ribeiro, Rita P.
    Gama, Joao
    DATA MINING AND KNOWLEDGE DISCOVERY, 2021, 35 (06) : 2389 - 2466