Undersampled K-means approach for handling imbalanced distributed data

被引:0
|
作者
Kumar, N. Santhosh [1 ]
Rao, K. Nageswara [2 ]
Govardhan, A. [3 ,4 ]
Reddy, K. Sudheer [5 ]
Mahmood, Ali Mirza [6 ]
机构
[1] JNTU, Dept CSE, Hyderabad, Andhra Prades, India
[2] PSCMR Coll Engn & Technol, Vijayawada, Andhra Prades, India
[3] CSE, Hyderabad, Andhra Prades, India
[4] JNTU, SIT, Hyderabad, Andhra Prades, India
[5] Infosys, Hyderabad, Andhra Prades, India
[6] DMS SVH Coll Engn, Machilipatam, Andhra Prades, India
关键词
Imbalanced data; K-means clustering algorithms; Undersampling; USKM;
D O I
10.1007/s13748-014-0045-6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
K-means is a partitional clustering technique that is well known and widely used for its low computational cost. However, the performance of K-means algorithm tends to be affected by skewed data distributions, i.e., imbalanced data. They often produce clusters of relatively uniform sizes, even if input data have varied cluster size, which is called the "uniform effect". In this paper, we analyze the causes of this effect and illustrate that it probably occurs more in the K-means clustering process. As the minority class decreases in size, the "uniform effect" becomes evident. To prevent the effect of the "uniform effect", we revisit the well-known K-means algorithm and provide a general method to properly cluster imbalance distributed data. The proposed algorithm consists of a novel undersampling technique implemented by intelligently removing noisy and weak instances from majority class. We conduct experiments using twelve UCI datasets from various application domains using five algorithms for comparison on eight evaluation metrics. Experimental results show the effectiveness of the proposed clustering algorithm in clustering balanced and imbalanced data.
引用
收藏
页码:29 / 38
页数:10
相关论文
共 50 条
  • [31] A K-means triangular synthesis large margin classifier with unified pinball loss for imbalanced data
    Shao, Danlin
    Dai, Yixi
    Li, Junjie
    Li, Shenglin
    Chen, Rui
    APPLIED SOFT COMPUTING, 2024, 167
  • [32] Distributed Clustering Based on K-means and CPGA
    Zhou, Jun
    Liu, Zhijing
    FIFTH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, VOL 2, PROCEEDINGS, 2008, : 444 - 447
  • [33] Conceptualized phrase clustering with distributed k-means
    Anoop, V. S.
    Asharaf, S.
    INTELLIGENT DECISION TECHNOLOGIES-NETHERLANDS, 2019, 13 (02): : 153 - 160
  • [34] Distributed k-Means with Outliers in General Metrics
    Dandolo, Enrico
    Pietracaprina, Andrea
    Pucci, Geppino
    EURO-PAR 2023: PARALLEL PROCESSING, 2023, 14100 : 474 - 488
  • [35] Entropy and sigmoid based K-means clustering and AGWO for effective big data handling
    Vankdothu, Ramdas
    Hameed, Mohd Abdul
    Bhukya, Raju
    Garg, Gaurav
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (10) : 15287 - 15304
  • [36] An Efficient Approach for Privacy Preserving Distributed K-Means Clustering in Unsecured Environment
    Shewale, Amit
    Keshavamurthy, B. N.
    Modi, Chirag N.
    RECENT FINDINGS IN INTELLIGENT COMPUTING TECHNIQUES, VOL 1, 2019, 707 : 425 - 431
  • [37] Efficient Privacy Preserving Distributed K-Means for Non-IID Data
    Brandao, Andre
    Mendes, Ricardo
    Vilela, Joao P.
    ADVANCES IN INTELLIGENT DATA ANALYSIS XIX, IDA 2021, 2021, 12695 : 439 - 451
  • [38] Efficient privacy-preserving outsourced k-means clustering on distributed data
    Qiu, Guowei
    Zhao, Yingliang
    Gui, Xiaolin
    INFORMATION SCIENCES, 2024, 674
  • [39] K-means for Evolving Data Streams
    Bidaurrazaga, Arkaitz
    Perez, Aritz
    Capo, Marco
    2021 21ST IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2021), 2021, : 1006 - 1011
  • [40] K-means algorithms for functional data
    Lopez Garcia, Maria Luz
    Garcia-Rodenas, Ricardo
    Gonzalez Gomez, Antonia
    NEUROCOMPUTING, 2015, 151 : 231 - 245