A Semi-Supervised Text Clustering Approach Based on K-Means Algorithm

被引:0
|
作者
Zhan, Lizhang [1 ]
Xu, Hong [1 ]
Chen, Xiuguo
机构
[1] Beihang Univ, Sch Comp Sci & Engn, Beijing, Peoples R China
关键词
text clustering; k-means clustering algorithm; feature weight learning; vector space model;
D O I
暂无
中图分类号
F [经济];
学科分类号
02 ;
摘要
This paper proposes a semi-supervised text clustering approach based on standard k-means algorithm. Firstly, the semi-supervised text clustering approach transforms textual data into algebraic vectors in a multidimensional space based on VSM (Vector Space Model) and the prior knowledge about the sample set into a set of constraint, which to be represented as a boolean matric. Secondly, the approach makes use of the constraints to supervise the feature weight learning with gradient descent technique and make use the learned feature weight matrix to optimize the Euclidean distance metric, making the distance metric not only reflect the prior constraint relationship of text set, but also discover latent semantic information between texts. Thirdly, applying the k-means clustering algorithm implements text clustering. Providing experimental evidence shows the approach obtains substantial improvements in text clustering compared with k-means clustering algorithm.
引用
收藏
页码:2616 / 2620
页数:5
相关论文
共 12 条
  • [1] [Anonymous], 2007, P 18 ANN ACM SIAM S
  • [2] Unsupervised feature selection using a neuro-fuzzy approach
    Basak, J
    De, RK
    Pal, SK
    [J]. PATTERN RECOGNITION LETTERS, 1998, 19 (11) : 997 - 1006
  • [3] Weighted k-Means Algorithm Based Text Clustering
    Chen, Xiuguo
    Yin, Wensheng
    Tu, Pinghui
    Zhang, Hengxi
    [J]. IEEC 2009: FIRST INTERNATIONAL SYMPOSIUM ON INFORMATION ENGINEERING AND ELECTRONIC COMMERCE, PROCEEDINGS, 2009, : 51 - +
  • [4] Hastie T, 2001, ELEMENTS STAT LEARNI, P459
  • [5] Automated variable weighting in k-means type clustering
    Huang, JZX
    Ng, MK
    Rong, HQ
    Li, ZC
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2005, 27 (05) : 657 - 668
  • [6] Huang Yifen., 2008, EMAIL'08: Proceedings of the Workshop on Enhanced Messaging-AAAI, P36
  • [7] Ji X., 2006, Proceedings of the Twenty-Ninth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P405, DOI 10.1145/1148170.1148241
  • [8] Manning C.D., 2010, INTRO INFORM RETRIEV, P326
  • [9] Nicholas O. A., 2007, RECENT DEV DOCUMENT
  • [10] Xing E.P., 2002, NIPS 15