X-DMM: Fast and Scalable Model Based Text Clustering

被引:0
|
作者
Li, Linwei [1 ]
Guo, Liangchen [1 ]
He, Zhenying [1 ,2 ,3 ]
Jing, Yinan [1 ,2 ,3 ]
Wang, X. Sean [1 ,2 ,3 ]
机构
[1] Fudan Univ, Sch Comp Sci & Technol, Shanghai, Peoples R China
[2] Shanghai Key Lab Data Sci, Shanghai, Peoples R China
[3] Shanghai Inst Intelligent Elect & Syst, Shanghai, Peoples R China
基金
中国国家自然科学基金;
关键词
ALGORITHMS;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text clustering is a widely studied problem in the text mining domain. The Dirichlet Multinomial Mixture (DMM) model based clustering algorithms have shown good performance to cope with high dimensional sparse text data, obtaining reasonable results in both clustering accuracy and computational efficiency. However, the time complexity of DMM model training is proportional to the average document length and the number of clusters, making it inefficient for scaling up to long text and large corpora, which is common in real-world applications such as documents organization, retrieval and recommendation. In this paper, we leverage a symmetric prior setting for Dirichlet distribution, and build indices to decrease the time complexity of the sampling-based training for DMM from O(K * L) to O(K * U), where K is the number of clusters, L the average length of document, and U the average number of unique words in each document. We introduce a Metropolis-Hastings sampling algorithm, which further reduces the sampling time complexity from O(K*U) to O(U) in the nearly-to-convergence training stages. Moreover, we also parallelize the DMM model training to obtain a further acceleration by using an uncollapsed Gibbs sampler. We combine all these optimizations into a highly efficient implementation, called X-DMM, which enables the DMM model to scale up for long and large-scale text clustering. We evaluate the performance of X-DMM on several real world datasets, and the experimental results show that X-DMM achieves substantial speed up compared with existing state-of-the-art algorithms without clustering accuracy degradation.
引用
收藏
页码:4197 / 4204
页数:8
相关论文
共 50 条
  • [1] Scalable k-NN based text clustering
    Lulli, Alessandro
    Debatty, Thibault
    Dell'Amico, Matteo
    Michiardi, Pietro
    Ricci, Laura
    PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2015, : 958 - 963
  • [2] Scalable, balanced model-based clustering
    Zhong, S
    Ghosh, J
    PROCEEDINGS OF THE THIRD SIAM INTERNATIONAL CONFERENCE ON DATA MINING, 2003, : 71 - 82
  • [3] Scalable text semantic clustering around topics
    Brena, Ramon
    Ramirez, Eduardo
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2019, 36 (05) : 4645 - 4657
  • [4] Fast and Scalable Protein Motif Sequence Clustering based on Hadoop Framework
    Farhangi, Erfan
    Ghadiri, Nasser
    Asadi, Mahsa
    Nikbakht, Mohammad Amin
    Pitre, Sylvain
    2017 3RD INTERNATIONAL CONFERENCE ON WEB RESEARCH (ICWR), 2017, : 24 - 31
  • [5] A generic query-based model for scalable clustering
    National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan
    NII Tech. Rep., 2006, 8 (19-21):
  • [6] Improved fast partitional clustering algorithm for text clustering
    Bejos, Sebastian
    Feliciano-Avelino, Ivan
    Martinez-Trinidad, J. Fco.
    Carrasco-Ochoa, J. A.
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2020, 39 (02) : 2137 - 2145
  • [7] An Ant-based Fast Text Clustering Approach Using Pheromone
    Zhang, Fuzhi
    Ma, Yujing
    Hou, Na
    Liu, Hui
    FIFTH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, VOL 2, PROCEEDINGS, 2008, : 385 - 389
  • [8] Summarization of Text Clustering based Vector Space Model
    Chen, Mingzhen
    Song, Yu
    2009 IEEE 10TH INTERNATIONAL CONFERENCE ON COMPUTER-AIDED INDUSTRIAL DESIGN & CONCEPTUAL DESIGN, VOLS 1-3: E-BUSINESS, CREATIVE DESIGN, MANUFACTURING - CAID&CD'2009, 2009, : 2362 - 2365
  • [9] A Wikipedia-based Semantic Model for Text Clustering
    Zhou, Jing-min
    Cui, Qing-jun
    Zhang, Hui
    2011 INTERNATIONAL CONFERENCE ON FUTURE COMPUTER SCIENCE AND APPLICATION (FCSA 2011), VOL 2, 2011, : 413 - 416
  • [10] The research on text clustering based on LDA joint model
    Li, Chen
    Yang, Cheng
    Jiang, Qin
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2017, 32 (05) : 3655 - 3667