Undersampled K-means approach for handling imbalanced distributed data

被引:0
|
作者
Kumar, N. Santhosh [1 ]
Rao, K. Nageswara [2 ]
Govardhan, A. [3 ,4 ]
Reddy, K. Sudheer [5 ]
Mahmood, Ali Mirza [6 ]
机构
[1] JNTU, Dept CSE, Hyderabad, Andhra Prades, India
[2] PSCMR Coll Engn & Technol, Vijayawada, Andhra Prades, India
[3] CSE, Hyderabad, Andhra Prades, India
[4] JNTU, SIT, Hyderabad, Andhra Prades, India
[5] Infosys, Hyderabad, Andhra Prades, India
[6] DMS SVH Coll Engn, Machilipatam, Andhra Prades, India
关键词
Imbalanced data; K-means clustering algorithms; Undersampling; USKM;
D O I
10.1007/s13748-014-0045-6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
K-means is a partitional clustering technique that is well known and widely used for its low computational cost. However, the performance of K-means algorithm tends to be affected by skewed data distributions, i.e., imbalanced data. They often produce clusters of relatively uniform sizes, even if input data have varied cluster size, which is called the "uniform effect". In this paper, we analyze the causes of this effect and illustrate that it probably occurs more in the K-means clustering process. As the minority class decreases in size, the "uniform effect" becomes evident. To prevent the effect of the "uniform effect", we revisit the well-known K-means algorithm and provide a general method to properly cluster imbalance distributed data. The proposed algorithm consists of a novel undersampling technique implemented by intelligently removing noisy and weak instances from majority class. We conduct experiments using twelve UCI datasets from various application domains using five algorithms for comparison on eight evaluation metrics. Experimental results show the effectiveness of the proposed clustering algorithm in clustering balanced and imbalanced data.
引用
收藏
页码:29 / 38
页数:10
相关论文
共 50 条
  • [1] Subset K-Means Approach for Handling Imbalanced-Distributed Data
    Kumar, Ch N. Santhosh
    Rao, K. Nageswara
    Govardhan, A.
    Sandhya, N.
    EMERGING ICT FOR BRIDGING THE FUTURE, VOL 2, 2015, 338 : 497 - 508
  • [2] Imbalanced data optimization combining K-means and SMOTE
    Li W.
    International Journal of Performability Engineering, 2019, 15 (08): : 2173 - 2181
  • [3] An AdaBoost Method with K'K-Means Bayes Classifier for Imbalanced Data
    Zhang, Yanfeng
    Wang, Lichun
    MATHEMATICS, 2023, 11 (08)
  • [4] Evolutionary k-means for distributed data sets
    Naldi, M. C.
    Campello, R. J. G. B.
    NEUROCOMPUTING, 2014, 127 : 30 - 42
  • [5] A Novel Ensemble Framework Based on K-Means and Resampling for Imbalanced Data
    Duan, Huajuan
    Wei, Yongqing
    Liu, Peiyu
    Yin, Hongxia
    APPLIED SCIENCES-BASEL, 2020, 10 (05):
  • [6] Using K-Means Clustering Algorithm for Handling Data Precision
    Suganthi, P.
    Kala, K.
    Balasubramanian, C.
    2016 INTERNATIONAL CONFERENCE ON COMPUTING TECHNOLOGIES AND INTELLIGENT DATA ENGINEERING (ICCTIDE'16), 2016,
  • [7] Visual K-Means Approach for Handling Class Imbalance Learning
    Kumar, Ch. N. Santhosh
    Rao, K. Nageswara
    Govardhan, A.
    PROCEEDINGS OF THE SECOND INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATION TECHNOLOGIES, IC3T 2015, VOL 3, 2016, 381 : 389 - 396
  • [8] Private Distributed K-Means Clustering on Interval Data
    Huang, Dingquan
    Yao, Xin
    An, Senquan
    Ren, Shengbing
    2021 IEEE INTERNATIONAL PERFORMANCE, COMPUTING, AND COMMUNICATIONS CONFERENCE (IPCCC), 2021,
  • [9] Neutrosophic K-means Based Method for Handling Unlabeled Data
    Arnaiz N.V.Q.
    Arias N.G.
    Muñoz L.C.C.
    Neutrosophic Sets and Systems, 2020, 37 : 309 - 315
  • [10] New k-Means data clustering approach
    College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo 454000, China
    不详
    不详
    J. Comput. Inf. Syst., 2008, 2 (565-570):