An Improved K-means Algorithm for Document Clustering

被引:8
|
作者
Wu, Guohua [1 ]
Lin, Hairong [1 ]
Fu, Ershuai [1 ]
Wang, Liuyang [1 ]
机构
[1] Hang Zhou Dian Zi Univ, Sch Comp Sci & Technol, Hangzhou, Zhejiang, Peoples R China
关键词
K-Means; SimHash; Text clustering;
D O I
10.1109/CSMA.2015.20
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
K-Means algorithm has a major shortcoming of high dimensional and sparse data. So the traditional measurement of the distance can't deal with the data effectively. Motivated by this, this paper proposed a K-Means algorithm based on SimHash. After preprocessing of the text, SimHash is used to calculate the feature vectors extracted, and then the fingerprint of each text is obtained. SimHash not only reduces the dimension of the text, but also directly calculates the Hamming distance between the fingerprints as the vector distance. According to the Hamming distance, it can judge which clustering the data is belongs to. Experimental result shows that the algorithm guarantees the quality of the clustering, and greatly reduces the speed of K-means clustering algorithm.
引用
收藏
页码:65 / 69
页数:5
相关论文
共 50 条
  • [1] Improved Document Clustering using K-means Algorithm
    Bide, Pramod
    Shedge, Rajashree
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON ELECTRICAL, COMPUTER AND COMMUNICATION TECHNOLOGIES, 2015,
  • [2] An Improved Hierarchical K-Means Algorithm for Web Document Clustering
    Liu, Yongxin
    Liu, Zhijng
    [J]. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND INFORMATION TECHNOLOGY, 2008, : 606 - 610
  • [3] Research on k-means Clustering Algorithm An Improved k-means Clustering Algorithm
    Shi Na
    Liu Xumin
    Guan Yong
    [J]. 2010 THIRD INTERNATIONAL SYMPOSIUM ON INTELLIGENT INFORMATION TECHNOLOGY AND SECURITY INFORMATICS (IITSI 2010), 2010, : 63 - 67
  • [4] An Improved K-means Clustering Algorithm
    Wang Yintong
    Li Wanlong
    Gao Rujia
    [J]. 2012 WORLD AUTOMATION CONGRESS (WAC), 2012,
  • [5] Improved K-means clustering algorithm
    Zhang, Zhe
    Zhang, Junxi
    Xue, Huifeng
    [J]. CISP 2008: FIRST INTERNATIONAL CONGRESS ON IMAGE AND SIGNAL PROCESSING, VOL 5, PROCEEDINGS, 2008, : 169 - 172
  • [6] Improved Algorithm for the k-means Clustering
    Zhang, Sheng
    Wang, Shouqiang
    [J]. PROCEEDINGS OF THE 10TH WORLD CONGRESS ON INTELLIGENT CONTROL AND AUTOMATION (WCICA 2012), 2012, : 4717 - 4720
  • [7] Harmony K-means algorithm for document clustering
    Mahdavi, Mehrdad
    Abolhassani, Hassan
    [J]. DATA MINING AND KNOWLEDGE DISCOVERY, 2009, 18 (03) : 370 - 391
  • [8] Harmony K-means algorithm for document clustering
    Mehrdad Mahdavi
    Hassan Abolhassani
    [J]. Data Mining and Knowledge Discovery, 2009, 18 : 370 - 391
  • [9] Research on Improved K-means Clustering Algorithm
    Zhang, Yinsheng
    Shan, Huilin
    Li, Jiaqiang
    Zhou, Jie
    [J]. MEMS, NANO AND SMART SYSTEMS, PTS 1-6, 2012, 403-408 : 1977 - 1980
  • [10] An Improved Kernel K-means Clustering Algorithm
    Liu, Yang
    Yin, Hong Peng
    Chai, Yi
    [J]. PROCEEDINGS OF 2016 CHINESE INTELLIGENT SYSTEMS CONFERENCE, VOL I, 2016, 404 : 275 - 280