X-DMM: Fast and Scalable Model Based Text Clustering

被引:0
|
作者
Li, Linwei [1 ]
Guo, Liangchen [1 ]
He, Zhenying [1 ,2 ,3 ]
Jing, Yinan [1 ,2 ,3 ]
Wang, X. Sean [1 ,2 ,3 ]
机构
[1] Fudan Univ, Sch Comp Sci & Technol, Shanghai, Peoples R China
[2] Shanghai Key Lab Data Sci, Shanghai, Peoples R China
[3] Shanghai Inst Intelligent Elect & Syst, Shanghai, Peoples R China
基金
中国国家自然科学基金;
关键词
ALGORITHMS;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text clustering is a widely studied problem in the text mining domain. The Dirichlet Multinomial Mixture (DMM) model based clustering algorithms have shown good performance to cope with high dimensional sparse text data, obtaining reasonable results in both clustering accuracy and computational efficiency. However, the time complexity of DMM model training is proportional to the average document length and the number of clusters, making it inefficient for scaling up to long text and large corpora, which is common in real-world applications such as documents organization, retrieval and recommendation. In this paper, we leverage a symmetric prior setting for Dirichlet distribution, and build indices to decrease the time complexity of the sampling-based training for DMM from O(K * L) to O(K * U), where K is the number of clusters, L the average length of document, and U the average number of unique words in each document. We introduce a Metropolis-Hastings sampling algorithm, which further reduces the sampling time complexity from O(K*U) to O(U) in the nearly-to-convergence training stages. Moreover, we also parallelize the DMM model training to obtain a further acceleration by using an uncollapsed Gibbs sampler. We combine all these optimizations into a highly efficient implementation, called X-DMM, which enables the DMM model to scale up for long and large-scale text clustering. We evaluate the performance of X-DMM on several real world datasets, and the experimental results show that X-DMM achieves substantial speed up compared with existing state-of-the-art algorithms without clustering accuracy degradation.
引用
收藏
页码:4197 / 4204
页数:8
相关论文
共 50 条
  • [41] Scalable swarm based fuzzy clustering
    Hall, LO
    Kanade, PM
    FROM DATA AND INFORMATION ANALYSIS TO KNOWLEDGE ENGINEERING, 2006, : 21 - +
  • [42] Cost Based Scalable Clustering in MANET
    Gupta, Shashi Kant
    Khatri, Pallavi
    Agrawal, Prerna
    2014 6TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMMUNICATION NETWORKS, 2014, : 438 - 443
  • [43] A scalable clustering method based on density
    Department of Computer Science, University of Texas at Dallas, A Box 830688, Richardson, TX 75083, United States
    WSEAS Trans. Comput., 2007, 8 (1036-1043):
  • [44] Fast text categorization based on a novel class space model
    Gao, Yingfan
    Ma, Runbo
    Liu, Yushu
    MICAI 2006: ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2006, 4293 : 1007 - +
  • [45] A Deep CFS Model for Text Clustering
    Lv, Bin
    Hou, Weiliang
    Liu, Guohua
    Gao, Jing
    Yuan, Xu
    Li, Peng
    Chen, Zhikui
    IEEE 2018 INTERNATIONAL CONGRESS ON CYBERMATICS / 2018 IEEE CONFERENCES ON INTERNET OF THINGS, GREEN COMPUTING AND COMMUNICATIONS, CYBER, PHYSICAL AND SOCIAL COMPUTING, SMART DATA, BLOCKCHAIN, COMPUTER AND INFORMATION TECHNOLOGY, 2018, : 132 - 137
  • [46] A Scalable Hybrid Ensemble Model for Text Classification
    Singh, Bharat
    Kushwaha, Nidhi
    Vyas, Om Prakash
    PROCEEDINGS OF THE 2016 IEEE REGION 10 CONFERENCE (TENCON), 2016, : 3148 - 3152
  • [47] A Model of a GEP-Based Text Clustering on Counter Propagation Networks
    Luo, Jin'guang
    Yuan, Chang'an
    Luo, Jinkun
    EMERGING RESEARCH IN ARTIFICIAL INTELLIGENCE AND COMPUTATIONAL INTELLIGENCE, 2011, 237 : 214 - 221
  • [48] CCM: A Text Classification Model by Clustering
    Nizamani, Sarwat
    Memon, Nasrullah
    Wiil, Uffe Kock
    Karampelas, Panagiotis
    2011 INTERNATIONAL CONFERENCE ON ADVANCES IN SOCIAL NETWORKS ANALYSIS AND MINING (ASONAM 2011), 2011, : 461 - 467
  • [49] An Efficient Concept-Based Mining Model for Enhancing Text Clustering
    Shehata, Shady
    Karray, Fakhri
    Kamel, Mohamed S.
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2010, 22 (10) : 1360 - 1371
  • [50] A Novel Short Text Clustering Model Based on Grey System Theory
    Hüseyin Fidan
    Mehmet Erkan Yuksel
    Arabian Journal for Science and Engineering, 2020, 45 : 2865 - 2882