A Study on the Impact of Data Characteristics in Imbalanced Regression Tasks

被引:5
|
作者
Branco, Paula [1 ]
Torgo, Luis [1 ]
机构
[1] Dalhousie Univ, Fac Comp Sci, Halifax, NS, Canada
关键词
Imbalance regression; pre-processing strategies; data characteristics;
D O I
10.1109/DSAA.2019.00034
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The class imbalance problem has been thoroughly studied over the past two decades. More recently, the research community realized that the problem of imbalanced distributions also occurred in other tasks beyond classification. Regression problems are among these newly studied tasks where the problem of imbalanced domains also poses important challenges. Imbalanced regression problems occur in a diversity of real world domains such as meteorological (predicting weather extreme values), financial (extreme stock returns forecasting) or medical (anticipate rare values). In imbalanced regression the end-user preferences are biased towards values of the target variable that are under-represented on the available data. Several pre-processing methods were proposed to address this problem. These methods change the training set to force the learner to focus on the rare cases. However, as far as we know, the relationship between the data intrinsic characteristics and the performance achieved by these methods has not yet been studied for imbalanced regression tasks. In this paper we describe a study of the impact certain data characteristics may have in the results of applying pre-processing methods to imbalanced regression problems. To achieve this goal, we define potentially interesting data characteristics of regression problems. We then conduct our study using a synthetic data repository build for this purpose. We show that all the different characteristics studied have a different behaviour that is related with the level at which the data characteristic is present and the learning algorithm used. The main contributions of our work are: i) to define interesting data characteristics for regression tasks; ii) to create the first repository of imbalanced regression tasks containing 6000 data sets with controlled data characteristics; and iii) to provide insights on the impact of intrinsic data characteristics in the results of pre-processing methods for handling imbalanced regression tasks.
引用
收藏
页码:193 / 202
页数:10
相关论文
共 50 条
  • [21] ReMAHA-CatBoost: Addressing Imbalanced Data in Traffic Accident Prediction Tasks
    Li, Guolian
    Wu, Yadong
    Bai, Yulong
    Zhang, Weihan
    APPLIED SCIENCES-BASEL, 2023, 13 (24):
  • [22] Robust weighted kernel logistic regression in imbalanced and rare events data
    Maalouf, Maher
    Trafalis, Theodore B.
    COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2011, 55 (01) : 168 - 183
  • [23] Performance of asymmetric links and correction methods for imbalanced data in binary regression
    Huayanay, Alex de la Cruz
    Bazan, Jorge L.
    Cancho, Vicente G.
    Dey, Dipak K.
    JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION, 2019, 89 (09) : 1694 - 1714
  • [24] Deep Regression Modeling for Imbalanced and Incomplete Time-Series Data
    Hssayeni, Murtadha D.
    Ghoraani, Behnaz
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2024, 8 (06): : 1 - 12
  • [25] Examining characteristics of predictive models with imbalanced big data
    Hasanin, Tawfiq
    Khoshgoftaar, Taghi M.
    Leevy, Joffrey L.
    Seliya, Naeem
    JOURNAL OF BIG DATA, 2019, 6 (01)
  • [26] Examining characteristics of predictive models with imbalanced big data
    Tawfiq Hasanin
    Taghi M. Khoshgoftaar
    Joffrey L. Leevy
    Naeem Seliya
    Journal of Big Data, 6
  • [27] CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks
    Li, Peng
    Rao, Xi
    Blase, Jennifer
    Zhang, Yue
    Chu, Xu
    Zhang, Ce
    2021 IEEE 37TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2021), 2021, : 13 - 24
  • [28] On the Impact of Imbalanced Data in Convolutional Neural Networks Performance
    Pulgar, Francisco J.
    Rivera, Antonio J.
    Charte, Francisco
    del Jesus, Maria J.
    HYBRID ARTIFICIAL INTELLIGENT SYSTEMS, HAIS 2017, 2017, 10334 : 220 - 232
  • [29] The Impact of Gene Selection on Imbalanced Microarray Expression Data
    Kamal, Abu H. M.
    Zhu, Xingquan
    Pandya, Abhijit S.
    Hsu, Sam
    Shoaib, Muhammad
    BIOINFORMATICS AND COMPUTATIONAL BIOLOGY, PROCEEDINGS, 2009, 5462 : 259 - 269
  • [30] A modification of logistic regression with imbalanced data: F-measure-oriented Lasso-logistic regression
    My, Bui T. T.
    Ta, Bao Q.
    SCIENCEASIA, 2023, 49 : 68 - 77