A Distance-Based Weighted Undersampling Scheme for Support Vector Machines and its Application to Imbalanced Classification

被引:109
|
作者
Kang, Qi [1 ]
Shi, Lei [1 ]
Zhou, MengChu [2 ,3 ]
Wang, XueSong [1 ]
Wu, Qidi [1 ]
Wei, Zhi [4 ]
机构
[1] Tongji Univ, Sch Elect & Informat Engn, Dept Control Sci & Engn, Shanghai 201804, Peoples R China
[2] Macau Univ Sci & Technol, Inst Syst Engn, Macau 999078, Peoples R China
[3] New Jersey Inst Technol, Dept Elect & Comp Engn, Newark, NJ 07102 USA
[4] New Jersey Inst Technol, Dept Comp Sci, Newark, NJ 07102 USA
关键词
Class imbalance; data distribution; Euclidean distance; support vector machine (SVM); undersampling; ENSEMBLE;
D O I
10.1109/TNNLS.2017.2755595
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A support vector machine (SVM) plays a prominent role in classic machine learning, especially classification and regression. Through its structural risk minimization, it has enjoyed a good reputation in effectively reducing overfitting, avoiding dimensional disaster, and not falling into local minima. Nevertheless, existing SVMs do not perform well when facing class imbalance and large-scale samples. Undersampling is a plausible alternative to solve imbalanced problems in some way, but suffers from soaring computational complexity and reduced accuracy because of its enormous iterations and random sampling process. To improve their classification performance in dealing with data imbalance problems, this work proposes a weighted undersampling (WU) scheme for SVM based on space geometry distance, and thus produces an improved algorithm named WU-SVM. In WU-SVM, majority samples are grouped into some subregions (SRs) and assigned different weights according to their Euclidean distance to the hyper plane. The samples in an SR with higher weight have more chance to be sampled and put to use in each learning iteration, so as to retain the data distribution information of original data sets as much as possible. Comprehensive experiments are performed to test WU-SVM via 21 binary-class and six multiclass publically available data sets. The results show that it well outperforms the state-of-the-art methods in terms of three popular metrics for imbalanced classification, i.e., area under the curve, F-Measure, and G-Mean.
引用
收藏
页码:4152 / 4165
页数:14
相关论文
共 50 条
  • [31] Constructing Support Vector Machines Ensemble Classification Method for Imbalanced Datasets Based on Fuzzy Integral
    Chen, Pu
    Zhang, Dayong
    [J]. MODERN ADVANCES IN APPLIED INTELLIGENCE, IEA/AIE 2014, PT I, 2014, 8481 : 70 - 76
  • [32] Kernel distance-based robust support vector methods and its application in developing a robust K-chart
    Kumar, S
    Choudhary, AK
    Kumar, M
    Shankar, R
    Tiwari, MK
    [J]. INTERNATIONAL JOURNAL OF PRODUCTION RESEARCH, 2006, 44 (01) : 77 - 96
  • [33] Support Vector Machines with Weighted Powered Kernels for Data Classification
    Afif, Mohammed H.
    Hedar, Abdel-Rahman
    Hamid, Taysir H. Abdel
    Mahdy, Yousef B.
    [J]. ADVANCED MACHINE LEARNING TECHNOLOGIES AND APPLICATIONS, 2012, 322 : 369 - 378
  • [34] Hypertext classification using weighted transductive support vector machines
    Liu, Shuang
    Jia, Chuan-Ying
    Chen, Peng
    [J]. PROCEEDINGS OF 2006 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7, 2006, : 3535 - +
  • [35] An efficient weighted Lagrangian twin support vector machine for imbalanced data classification
    Shao, Yuan-Hai
    Chen, Wei-Jie
    Zhang, Jing-Jing
    Wang, Zhen
    Deng, Nai-Yang
    [J]. PATTERN RECOGNITION, 2014, 47 (09) : 3158 - 3167
  • [36] A Wasserstein Distance-Based Cost-Sensitive Framework for Imbalanced Data Classification
    Feng, Rui
    Ji, Hongbing
    Zhu, Zhigang
    Wang, Lei
    [J]. RADIOENGINEERING, 2023, 32 (03) : 451 - 466
  • [37] Clustering-based undersampling with random over sampling examples and support vector machine for imbalanced classification of breast cancer diagnosis
    Zhang, Jue
    Chen, Li
    [J]. COMPUTER ASSISTED SURGERY, 2019, 24 : 62 - 72
  • [38] A distance-based shape descriptor invariant to similitude and its application to shape classification
    Presles, Benoit
    Debaylet, Johan
    [J]. 2016 23RD INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2016, : 2598 - 2603
  • [39] Application of weighted support vector machines to network intrusion detection
    Jia, YS
    Jia, CY
    Qi, HW
    [J]. SHAPING BUSINESS STRATEGY IN A NETWORKED WORLD, VOLS 1 AND 2, PROCEEDINGS, 2004, : 1025 - 1029
  • [40] Support Vector Machines Based on Weighted Scatter Degree
    Jin, A-Long
    Zhou, Xin
    Ye, Chi-Zhou
    [J]. ARTIFICIAL INTELLIGENCE AND COMPUTATIONAL INTELLIGENCE, PT III, 2011, 7004 : 620 - 629