An Earth mover's distance-based undersampling approach for handling class-imbalanced data

被引:0
|
作者
Rekha G. [1 ]
Krishna Reddy V. [2 ]
Tyagi A.K. [3 ]
机构
[1] Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Hyderabad, Telangana
[2] Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Guntur
[3] School of Computing Science and Engineering, Vellore Institute of Technology, Chennai Campus, Chennai, Tamil Nadu
关键词
Class imbalance; Classification; Data pre-processing; Earth mover's distance; EMD; Sampling technique;
D O I
10.1504/IJIIDS.2020.109463
中图分类号
学科分类号
摘要
Imbalanced datasets typically make prediction accuracy difficult. Most of the real-world data are imbalanced in nature. The traditional classifiers assume a well-balanced class distribution for training data but in practical datasets show up an imbalance, thus obscure a classifier and degrade its capability to learn from such imbalanced datasets. Data pre-processing approaches address this concern by using either random undersampling or oversampling techniques. In this paper, we introduce Earth mover's distance (EMD), as a similarity measure, to find the samples similar in nature and eliminate them as redundant from the dataset. Earth mover's distance has received a lot of attention in wide areas such as computer vision, image retrieval, machine learning, etc. The Earth mover's distance-based undersampling approach provides a solution at the data level to eliminate the redundant instances in majority samples without any loss of valuable information. This method is implemented with five conventional classifiers and one ensemble technique respectively, like C4.5 decision tree (DT), k-nearest neighbour (k-NN), multilayer perceptron (MLP), support vector machine (SVM), naive Bayes (NB) and AdaBoost technique. The proposed method yields a superior performance on 21 datasets from Keel repository. © 2020 Inderscience Enterprises Ltd.
引用
收藏
页码:376 / 392
页数:16
相关论文
共 50 条
  • [1] Clustering-based undersampling in class-imbalanced data
    Lin, Wei-Chao
    Tsai, Chih-Fong
    Hu, Ya-Han
    Jhang, Jing-Shang
    [J]. INFORMATION SCIENCES, 2017, 409 : 17 - 26
  • [2] Fuzzy Distance-based Undersampling Technique for Imbalanced Flood Data
    Mahamud, Ku Ruhana Ku
    Zorkeflee, Maisarah
    Din, Aniza Mohamed
    [J]. PROCEEDINGS OF KNOWLEDGE MANAGEMENT INTERNATIONAL CONFERENCE (KMICE) 2016, 2016, : 509 - 513
  • [3] Subclass-based Undersampling for Class-imbalanced Image Classification
    Lehmann, Daniel
    Ebner, Marc
    [J]. PROCEEDINGS OF THE 17TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER VISION, IMAGING AND COMPUTER GRAPHICS THEORY AND APPLICATIONS (VISAPP), VOL 5, 2022, : 493 - 500
  • [4] Hellinger distance-based stable sparse feature selection for high-dimensional class-imbalanced data
    Fu, Guang-Hui
    Wu, Yuan-Jiao
    Zong, Min-Jie
    Pan, Jianxin
    [J]. BMC BIOINFORMATICS, 2020, 21 (01)
  • [5] Hellinger distance-based stable sparse feature selection for high-dimensional class-imbalanced data
    Guang-Hui Fu
    Yuan-Jiao Wu
    Min-Jie Zong
    Jianxin Pan
    [J]. BMC Bioinformatics, 21
  • [6] Neighbourhood-based undersampling approach for handling imbalanced and overlapped data
    Vuttipittayamongkol, Pattaramon
    Elyan, Eyad
    [J]. INFORMATION SCIENCES, 2020, 509 : 47 - 70
  • [7] Undersampling method based on minority class density for imbalanced data
    Sun, Zhongqiang
    Ying, Wenhao
    Zhang, Wenjin
    Gong, Shengrong
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2024, 249
  • [8] Evolutionary Mahalanobis Distance-Based Oversampling for Multi-Class Imbalanced Data Classification
    Yao, Leehter
    Lin, Tung-Bin
    [J]. SENSORS, 2021, 21 (19)
  • [10] Classifying imbalanced data in distance-based feature space
    Shin Ando
    [J]. Knowledge and Information Systems, 2016, 46 : 707 - 730