A Study on the Impact of Data Characteristics in Imbalanced Regression Tasks

被引：5

作者：

Branco, Paula ^{[1
]}

Torgo, Luis ^{[1
]}

机构：

[1] Dalhousie Univ, Fac Comp Sci, Halifax, NS, Canada

来源：

2019 IEEE INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS (DSAA 2019) | 2019年

关键词：

Imbalance regression; pre-processing strategies; data characteristics;

D O I：

10.1109/DSAA.2019.00034

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The class imbalance problem has been thoroughly studied over the past two decades. More recently, the research community realized that the problem of imbalanced distributions also occurred in other tasks beyond classification. Regression problems are among these newly studied tasks where the problem of imbalanced domains also poses important challenges. Imbalanced regression problems occur in a diversity of real world domains such as meteorological (predicting weather extreme values), financial (extreme stock returns forecasting) or medical (anticipate rare values). In imbalanced regression the end-user preferences are biased towards values of the target variable that are under-represented on the available data. Several pre-processing methods were proposed to address this problem. These methods change the training set to force the learner to focus on the rare cases. However, as far as we know, the relationship between the data intrinsic characteristics and the performance achieved by these methods has not yet been studied for imbalanced regression tasks. In this paper we describe a study of the impact certain data characteristics may have in the results of applying pre-processing methods to imbalanced regression problems. To achieve this goal, we define potentially interesting data characteristics of regression problems. We then conduct our study using a synthetic data repository build for this purpose. We show that all the different characteristics studied have a different behaviour that is related with the level at which the data characteristic is present and the learning algorithm used. The main contributions of our work are: i) to define interesting data characteristics for regression tasks; ii) to create the first repository of imbalanced regression tasks containing 6000 data sets with controlled data characteristics; and iii) to provide insights on the impact of intrinsic data characteristics in the results of pre-processing methods for handling imbalanced regression tasks.

引用

页码：193 / 202

页数：10

共 50 条

[41] Confusion-Matrix-Based Kernel Logistic Regression for Imbalanced Data Classification
Ohsaki, Miho
Wang, Peng
Matsuda, Kenji
Katagiri, Shigeru
Watanabe, Hideyuki
Ralescu, Anca
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2017, 29 (09) : 1806 - 1819
[42] An Effective Ensemble Method for Multi-class Classification and Regression for Imbalanced Data
Alam, Tahira
Ahmed, Chowdhury Farhan
Zahin, Sabit Anwar
Khan, Muhammad Asif Hossain
Islam, Maliha Tashfia
ADVANCES IN DATA MINING: APPLICATIONS AND THEORETICAL ASPECTS (ICDM 2018), 2018, 10933 : 59 - 74
[43] The impact of data difficulty factors on classification of imbalanced and concept drifting data streams
Dariusz Brzezinski
Leandro L. Minku
Tomasz Pewinski
Jerzy Stefanowski
Artur Szumaczuk
Knowledge and Information Systems, 2021, 63 : 1429 - 1469
[44] The impact of data difficulty factors on classification of imbalanced and concept drifting data streams
Brzezinski, Dariusz
Minku, Leandro L.
Pewinski, Tomasz
Stefanowski, Jerzy
Szumaczuk, Artur
KNOWLEDGE AND INFORMATION SYSTEMS, 2021, 63 (06) : 1429 - 1469
[45] Impact of Hyperparameter Tuning in Classifying Highly Imbalanced Big Data
Hancock, John
Khoshgoftaar, Taghi M.
2021 IEEE 22ND INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION FOR DATA SCIENCE (IRI 2021), 2021, : 348 - 354
[46] Mining impact-targeted activity patterns in imbalanced data
Cao, Longbing
Zhao, Yanchang
Zhang, Chengqi
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2008, 20 (08) : 1053 - 1066
[47] The Impact of Imbalanced Training Data on Local Matching Learning of Ontologies
Laadhar, Amir
Ghozzi, Faiza
Megdiche, Imen
Ravat, Franck
Teste, Olivier
Gargouri, Faiez
BUSINESS INFORMATION SYSTEMS, PT I, 2019, 353 : 162 - 175
[48] Impact of Data and Study Characteristics on Microbiome Volatility Estimates
Park, Daniel J. J.
Plantinga, Anna M. M.
GENES, 2023, 14 (01)
[49] Active Learning for Imbalanced Ordinal Regression
Ge, Jiaming
Chen, Haiyan
Zhang, Dongfang
Hou, Xiaye
Yuan, Ligang
IEEE ACCESS, 2020, 8 (08): : 180608 - 180617
[50] ADDRESSING IMBALANCED INSURANCE DATA THROUGH ZERO-INFLATED POISSON REGRESSION WITH BOOSTING
Lee, Simon C. K.
ASTIN BULLETIN, 2021, 51 (01): : 27 - 55

← 1 2 3 4 5 →