Undersampled K-means approach for handling imbalanced distributed data

被引:0
|
作者
Kumar, N. Santhosh [1 ]
Rao, K. Nageswara [2 ]
Govardhan, A. [3 ,4 ]
Reddy, K. Sudheer [5 ]
Mahmood, Ali Mirza [6 ]
机构
[1] JNTU, Dept CSE, Hyderabad, Andhra Prades, India
[2] PSCMR Coll Engn & Technol, Vijayawada, Andhra Prades, India
[3] CSE, Hyderabad, Andhra Prades, India
[4] JNTU, SIT, Hyderabad, Andhra Prades, India
[5] Infosys, Hyderabad, Andhra Prades, India
[6] DMS SVH Coll Engn, Machilipatam, Andhra Prades, India
关键词
Imbalanced data; K-means clustering algorithms; Undersampling; USKM;
D O I
10.1007/s13748-014-0045-6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
K-means is a partitional clustering technique that is well known and widely used for its low computational cost. However, the performance of K-means algorithm tends to be affected by skewed data distributions, i.e., imbalanced data. They often produce clusters of relatively uniform sizes, even if input data have varied cluster size, which is called the "uniform effect". In this paper, we analyze the causes of this effect and illustrate that it probably occurs more in the K-means clustering process. As the minority class decreases in size, the "uniform effect" becomes evident. To prevent the effect of the "uniform effect", we revisit the well-known K-means algorithm and provide a general method to properly cluster imbalance distributed data. The proposed algorithm consists of a novel undersampling technique implemented by intelligently removing noisy and weak instances from majority class. We conduct experiments using twelve UCI datasets from various application domains using five algorithms for comparison on eight evaluation metrics. Experimental results show the effectiveness of the proposed clustering algorithm in clustering balanced and imbalanced data.
引用
收藏
页码:29 / 38
页数:10
相关论文
共 50 条
  • [21] Distributed Algorithm for Text Documents Clustering Based on k-Means Approach
    Sarnovsky, Martin
    Carnoka, Noema
    INFORMATION SYSTEMS ARCHITECTURE AND TECHNOLOGY, ISAT 2015, PT II, 2016, 430 : 165 - 174
  • [22] Distributed threshold k-means clustering for privacy preserving data mining
    Baby, Vadlana
    Chandra, N. Subhash
    2016 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2016, : 2286 - 2289
  • [23] Warped K-Means: An algorithm to cluster sequentially-distributed data
    Leiva, Luis A.
    Vidal, Enrique
    INFORMATION SCIENCES, 2013, 237 : 196 - 210
  • [24] A GPS Data based Distributed K-means for Cabstand Location Selection
    He, Tianjia
    Gui, Wei
    Zhang, Bo
    Lu, Ke
    2017 INTERNATIONAL SMART CITIES CONFERENCE (ISC2), 2017,
  • [25] Integration of distributed biological data using modified K-means algorithm
    Jeong, Jongil
    Ryu, Byunggul
    Shin, Dongil
    Shin, Dongkyoo
    EMERGING TECHNOLOGIES IN KNOWLEDGE DISCOVERY AND DATA MINING, 2007, 4819 : 469 - +
  • [26] Automatic Determination of K in Distributed K-Means Clustering
    Kotary, Dinesh Kumar
    Nanda, Satyasai Jagannath
    2ND INTERNATIONAL CONFERENCE ON RECENT TRENDS IN ADVANCED COMPUTING ICRTAC -DISRUP - TIV INNOVATION , 2019, 2019, 165 : 556 - 564
  • [27] Soil data clustering by using K-means and fuzzy K-means algorithm
    Hot, Elma
    Popovic-Bugarin, Vesna
    2015 23RD TELECOMMUNICATIONS FORUM TELFOR (TELFOR), 2015, : 890 - 893
  • [28] Adapting K-Means Algorithm for Pair-Wise Constrained Clustering of Imbalanced Data Streams
    Wojciechowski, Szymon
    Gonzalez-Almagro, German
    Garcia, Salvador
    Wozniak, Michal
    HYBRID ARTIFICIAL INTELLIGENT SYSTEMS, HAIS 2022, 2022, 13469 : 153 - 163
  • [29] A cluster-based oversampling algorithm combining SMOTE and k-means for imbalanced medical data
    Xu, Zhaozhao
    Shen, Derong
    Nie, Tiezheng
    Kou, Yue
    Yin, Nan
    Han, Xi
    INFORMATION SCIENCES, 2021, 572 : 574 - 589
  • [30] Classifying Imbalanced Data using an Svm Ensemble with k-means Clustering in Semiconductor TEST Process
    Park, Eun-mi
    Lee, Jee-hyOng
    SIXTH INTERNATIONAL CONFERENCE ON MACHINE VISION (ICMV 2013), 2013, 9067