Imbalanced classification in sparse and large behaviour datasets

被引：0

作者：

Jellis Vanhoeyveld

David Martens

机构：

[1] Department of Engineering Management,

来源：

Data Mining and Knowledge Discovery | 2018年 / 32卷

关键词：

Imbalanced learning; Behaviour data; Over-and undersampling; Cost-sensitive learning; Support vector machine (SVM); On-line repository;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Recent years have witnessed a growing number of publications dealing with the imbalanced learning issue. While a plethora of techniques have been investigated on traditional low-dimensional data, little is known on the effect thereof on behaviour data. This kind of data reflects fine-grained behaviours of individuals or organisations and is characterized by sparseness and very large dimensions. In this article, we investigate the effects of several over-and undersampling, cost-sensitive learning and boosting techniques on the problem of learning from imbalanced behaviour data. Oversampling techniques show a good overall performance and do not seem to suffer from overfitting as traditional studies report. A variety of undersampling approaches are investigated as well and show the performance degrading effect of instances showing odd behaviour. Furthermore, the boosting process indicates that the regularization parameter in the SVM formulation acts as a weakness indicator and that a combination of weak learners can often achieve better generalization than a single strong learner. Finally, the EasyEnsemble technique is presented as the method outperforming all others. By randomly sampling several balanced subsets, feeding them to a boosting process and subsequently combining their hypotheses, a classifier is obtained that achieves noise/outlier reduction effects and simultaneously explores the majority class space efficiently. Furthermore, the method is very fast since it is parallelizable and each subset is only twice as large as the minority class size.

引用

页码：25 / 82

页数：57

共 50 条

[1] Imbalanced classification in sparse and large behaviour datasets
Vanhoeyveld, Jellis
Martens, David
[J]. DATA MINING AND KNOWLEDGE DISCOVERY, 2018, 32 (01) : 25 - 82
[2] Sparse Matrix Classification on Imbalanced Datasets Using Convolutional Neural Networks
Pichel, Juan C.
Pateiro-Lopez, Beatriz
[J]. IEEE ACCESS, 2019, 7 : 82377 - 82389
[3] To improve classification of imbalanced datasets
Shukla, Pratyusha
Bhowmick, Kiran
[J]. 2017 INTERNATIONAL CONFERENCE ON INNOVATIONS IN INFORMATION, EMBEDDED AND COMMUNICATION SYSTEMS (ICIIECS), 2017,
[4] Classification of Antimicrobial Peptides with Imbalanced Datasets
Camacho, Francy L.
Torres, Rodrigo
Ramos Pollan, Raul
[J]. 11TH INTERNATIONAL SYMPOSIUM ON MEDICAL INFORMATION PROCESSING AND ANALYSIS, 2015, 9681
[5] Discrimination Aware Classification for Imbalanced Datasets
Ristanoski, Goce
Liu, Wei
Bailey, James
[J]. PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13), 2013, : 1529 - 1532
[6] A robust loss function for classification with imbalanced datasets
Wang, Yidan
Yang, Liming
[J]. NEUROCOMPUTING, 2019, 331 : 40 - 49
[7] FLSOM with Different Rates for Classification in Imbalanced Datasets
Machon-Gonzalez, Ivan
Lopez-Garcia, Hilario
[J]. ARTIFICIAL NEURAL NETWORKS - ICANN 2008, PT I, 2008, 5163 : 642 - 651
[8] Categorical classifiers in multiclass classification with imbalanced datasets
Carpita, Maurizio
Golia, Silvia
[J]. STATISTICAL ANALYSIS AND DATA MINING, 2023, 16 (04) : 391 - 405
[9] Convolutional Rebalancing Network for the Classification of Large Imbalanced Rice Pest and Disease Datasets in the Field
Yang, Guofeng
Chen, Guipeng
Li, Cong
Fu, Jiangfan
Guo, Yang
Liang, Hua
[J]. FRONTIERS IN PLANT SCIENCE, 2021, 12
[10] Active Sample Selection Through Sparse Neighborhood for Imbalanced Datasets
Gu, Ping
Ling, Zhao
Shao, Si Yu
Zhou, Meng
[J]. 2019 IEEE SYMPOSIUM ON COMPUTERS AND COMMUNICATIONS (ISCC), 2019, : 112 - 117

← 1 2 3 4 5 →