Imbalanced classification in sparse and large behaviour datasets

被引:0
|
作者
Jellis Vanhoeyveld
David Martens
机构
[1] Department of Engineering Management,
来源
关键词
Imbalanced learning; Behaviour data; Over-and undersampling; Cost-sensitive learning; Support vector machine (SVM); On-line repository;
D O I
暂无
中图分类号
学科分类号
摘要
Recent years have witnessed a growing number of publications dealing with the imbalanced learning issue. While a plethora of techniques have been investigated on traditional low-dimensional data, little is known on the effect thereof on behaviour data. This kind of data reflects fine-grained behaviours of individuals or organisations and is characterized by sparseness and very large dimensions. In this article, we investigate the effects of several over-and undersampling, cost-sensitive learning and boosting techniques on the problem of learning from imbalanced behaviour data. Oversampling techniques show a good overall performance and do not seem to suffer from overfitting as traditional studies report. A variety of undersampling approaches are investigated as well and show the performance degrading effect of instances showing odd behaviour. Furthermore, the boosting process indicates that the regularization parameter in the SVM formulation acts as a weakness indicator and that a combination of weak learners can often achieve better generalization than a single strong learner. Finally, the EasyEnsemble technique is presented as the method outperforming all others. By randomly sampling several balanced subsets, feeding them to a boosting process and subsequently combining their hypotheses, a classifier is obtained that achieves noise/outlier reduction effects and simultaneously explores the majority class space efficiently. Furthermore, the method is very fast since it is parallelizable and each subset is only twice as large as the minority class size.
引用
收藏
页码:25 / 82
页数:57
相关论文
共 50 条
  • [1] Imbalanced classification in sparse and large behaviour datasets
    Vanhoeyveld, Jellis
    Martens, David
    [J]. DATA MINING AND KNOWLEDGE DISCOVERY, 2018, 32 (01) : 25 - 82
  • [2] Sparse Matrix Classification on Imbalanced Datasets Using Convolutional Neural Networks
    Pichel, Juan C.
    Pateiro-Lopez, Beatriz
    [J]. IEEE ACCESS, 2019, 7 : 82377 - 82389
  • [3] To improve classification of imbalanced datasets
    Shukla, Pratyusha
    Bhowmick, Kiran
    [J]. 2017 INTERNATIONAL CONFERENCE ON INNOVATIONS IN INFORMATION, EMBEDDED AND COMMUNICATION SYSTEMS (ICIIECS), 2017,
  • [4] Classification of Antimicrobial Peptides with Imbalanced Datasets
    Camacho, Francy L.
    Torres, Rodrigo
    Ramos Pollan, Raul
    [J]. 11TH INTERNATIONAL SYMPOSIUM ON MEDICAL INFORMATION PROCESSING AND ANALYSIS, 2015, 9681
  • [5] Discrimination Aware Classification for Imbalanced Datasets
    Ristanoski, Goce
    Liu, Wei
    Bailey, James
    [J]. PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13), 2013, : 1529 - 1532
  • [6] A robust loss function for classification with imbalanced datasets
    Wang, Yidan
    Yang, Liming
    [J]. NEUROCOMPUTING, 2019, 331 : 40 - 49
  • [7] FLSOM with Different Rates for Classification in Imbalanced Datasets
    Machon-Gonzalez, Ivan
    Lopez-Garcia, Hilario
    [J]. ARTIFICIAL NEURAL NETWORKS - ICANN 2008, PT I, 2008, 5163 : 642 - 651
  • [8] Categorical classifiers in multiclass classification with imbalanced datasets
    Carpita, Maurizio
    Golia, Silvia
    [J]. STATISTICAL ANALYSIS AND DATA MINING, 2023, 16 (04) : 391 - 405
  • [9] Convolutional Rebalancing Network for the Classification of Large Imbalanced Rice Pest and Disease Datasets in the Field
    Yang, Guofeng
    Chen, Guipeng
    Li, Cong
    Fu, Jiangfan
    Guo, Yang
    Liang, Hua
    [J]. FRONTIERS IN PLANT SCIENCE, 2021, 12
  • [10] Active Sample Selection Through Sparse Neighborhood for Imbalanced Datasets
    Gu, Ping
    Ling, Zhao
    Shao, Si Yu
    Zhou, Meng
    [J]. 2019 IEEE SYMPOSIUM ON COMPUTERS AND COMMUNICATIONS (ISCC), 2019, : 112 - 117