An Improved K-means Algorithm for Document Clustering

被引:8
|
作者
Wu, Guohua [1 ]
Lin, Hairong [1 ]
Fu, Ershuai [1 ]
Wang, Liuyang [1 ]
机构
[1] Hang Zhou Dian Zi Univ, Sch Comp Sci & Technol, Hangzhou, Zhejiang, Peoples R China
关键词
K-Means; SimHash; Text clustering;
D O I
10.1109/CSMA.2015.20
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
K-Means algorithm has a major shortcoming of high dimensional and sparse data. So the traditional measurement of the distance can't deal with the data effectively. Motivated by this, this paper proposed a K-Means algorithm based on SimHash. After preprocessing of the text, SimHash is used to calculate the feature vectors extracted, and then the fingerprint of each text is obtained. SimHash not only reduces the dimension of the text, but also directly calculates the Hamming distance between the fingerprints as the vector distance. According to the Hamming distance, it can judge which clustering the data is belongs to. Experimental result shows that the algorithm guarantees the quality of the clustering, and greatly reduces the speed of K-means clustering algorithm.
引用
收藏
页码:65 / 69
页数:5
相关论文
共 50 条
  • [21] An Improved Genetic K-Means Algorithm for Spatial Clustering
    Wang, Yuanni
    Ge, Fei
    [J]. PROGRESS IN INTELLIGENCE COMPUTATION AND APPLICATIONS, 2008, : 123 - 126
  • [22] Improved k-means clustering algorithm and its applications
    Qi, Hui
    Li, Jinqing
    Di, Xiaoqiang
    Ren, Weiwu
    Zhang, Fengrong
    [J]. Recent Patents on Engineering, 2019, 13 (04) : 403 - 409
  • [23] Improved K-Means algorithm in text semantic clustering
    Ma, Junhong
    [J]. Open Cybernetics and Systemics Journal, 2014, 8 : 530 - 534
  • [24] Improved K-means clustering algorithm in intrusion detection
    Xiao, ShiSong
    Li, XiaoXu
    Liu, XueJiao
    [J]. 2008 PROCEEDINGS OF INFORMATION TECHNOLOGY AND ENVIRONMENTAL SYSTEM SCIENCES: ITESS 2008, VOL 2, 2008, : 771 - 775
  • [25] An improved k-means clustering algorithm for the community discovery
    JiangYan, Sun
    [J]. Journal of Software Engineering, 2015, 9 (02): : 242 - 253
  • [27] An Improved K-means Clustering Algorithm for Complex Networks
    Li, Hao
    Wang, Haoxiang
    Chen, Zengxian
    [J]. PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND ELECTRONIC TECHNOLOGY, 2015, 3 : 90 - 93
  • [28] An Improved K-means Clustering Algorithm Based on Dissimilarity
    Wang Shunye
    [J]. PROCEEDINGS 2013 INTERNATIONAL CONFERENCE ON MECHATRONIC SCIENCES, ELECTRIC ENGINEERING AND COMPUTER (MEC), 2013, : 2629 - 2633
  • [29] An Approach for Document Clustering using PSO and K-means Algorithm
    Chouhan, Rashmi
    Purohit, Anuradha
    [J]. PROCEEDINGS OF THE 2ND INTERNATIONAL CONFERENCE ON INVENTIVE SYSTEMS AND CONTROL (ICISC 2018), 2018, : 1380 - 1384
  • [30] An ellipsoidal K-means for document clustering
    Dzogang, Fabon
    Marsala, Christophe
    Lesot, Marie-Jeanne
    Rifqi, Maria
    [J]. 12TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2012), 2012, : 221 - 230