Undersampled K-means approach for handling imbalanced distributed data

被引:0
|
作者
Kumar, N. Santhosh [1 ]
Rao, K. Nageswara [2 ]
Govardhan, A. [3 ,4 ]
Reddy, K. Sudheer [5 ]
Mahmood, Ali Mirza [6 ]
机构
[1] JNTU, Dept CSE, Hyderabad, Andhra Prades, India
[2] PSCMR Coll Engn & Technol, Vijayawada, Andhra Prades, India
[3] CSE, Hyderabad, Andhra Prades, India
[4] JNTU, SIT, Hyderabad, Andhra Prades, India
[5] Infosys, Hyderabad, Andhra Prades, India
[6] DMS SVH Coll Engn, Machilipatam, Andhra Prades, India
关键词
Imbalanced data; K-means clustering algorithms; Undersampling; USKM;
D O I
10.1007/s13748-014-0045-6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
K-means is a partitional clustering technique that is well known and widely used for its low computational cost. However, the performance of K-means algorithm tends to be affected by skewed data distributions, i.e., imbalanced data. They often produce clusters of relatively uniform sizes, even if input data have varied cluster size, which is called the "uniform effect". In this paper, we analyze the causes of this effect and illustrate that it probably occurs more in the K-means clustering process. As the minority class decreases in size, the "uniform effect" becomes evident. To prevent the effect of the "uniform effect", we revisit the well-known K-means algorithm and provide a general method to properly cluster imbalance distributed data. The proposed algorithm consists of a novel undersampling technique implemented by intelligently removing noisy and weak instances from majority class. We conduct experiments using twelve UCI datasets from various application domains using five algorithms for comparison on eight evaluation metrics. Experimental results show the effectiveness of the proposed clustering algorithm in clustering balanced and imbalanced data.
引用
收藏
页码:29 / 38
页数:10
相关论文
共 50 条
  • [41] K-Means Clustering With Incomplete Data
    Wang, Siwei
    Li, Miaomiao
    Hu, Ning
    Zhu, En
    Hu, Jingtao
    Liu, Xinwang
    Yin, Jianping
    IEEE ACCESS, 2019, 7 : 69162 - 69171
  • [42] k-Means Clustering of Asymmetric Data
    Olszewski, Dominik
    HYBRID ARTIFICIAL INTELLIGENT SYSTEMS, PT I, 2012, 7208 : 243 - 254
  • [43] Kernel K-means for categorical data
    Couto, J
    ADVANCES IN INTELLIGENT DATA ANALYSIS VI, PROCEEDINGS, 2005, 3646 : 46 - 56
  • [44] KmL: k-means for longitudinal data
    Christophe Genolini
    Bruno Falissard
    Computational Statistics, 2010, 25 : 317 - 328
  • [45] KmL: k-means for longitudinal data
    Genolini, Christophe
    Falissard, Bruno
    COMPUTATIONAL STATISTICS, 2010, 25 (02) : 317 - 328
  • [46] The Research of Imbalanced Data Set of Sample Sampling Method Based on K-Means Cluster and Genetic Algorithm
    Yong, Yang
    2012 INTERNATIONAL CONFERENCE ON FUTURE ELECTRICAL POWER AND ENERGY SYSTEM, PT A, 2012, 17 : 164 - 170
  • [47] An augmented K-means clustering approach for the detection of distributed denial-of-service attacks
    Marvi, Murk
    Arfeen, Asad
    Uddin, Riaz
    INTERNATIONAL JOURNAL OF NETWORK MANAGEMENT, 2021, 31 (06)
  • [48] Differentially Private K-Means Publishing with Distributed Dimensions
    Zhu, Boyu
    Zhang, Yuan
    Chen, Tingting
    Zhong, Sheng
    PROCEEDINGS OF THE 2024 27 TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN, CSCWD 2024, 2024, : 3263 - 3268
  • [49] Combining Parallel Self-Organizing Maps and K-Means to Cluster Distributed Data
    Gorgonio, Flavius L.
    Costa, Jose Alfredo F.
    CSE 2008: PROCEEDINGS OF THE 11TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND ENGINEERING, 2008, : 53 - 58
  • [50] Improving k-means through distributed scalable metaheuristics
    Oliveira, G. V.
    Coutinho, F. P.
    Campello, R. J. G. B.
    Naldi, M. C.
    NEUROCOMPUTING, 2017, 246 : 45 - 57