X-DMM: Fast and Scalable Model Based Text Clustering

被引:0
|
作者
Li, Linwei [1 ]
Guo, Liangchen [1 ]
He, Zhenying [1 ,2 ,3 ]
Jing, Yinan [1 ,2 ,3 ]
Wang, X. Sean [1 ,2 ,3 ]
机构
[1] Fudan Univ, Sch Comp Sci & Technol, Shanghai, Peoples R China
[2] Shanghai Key Lab Data Sci, Shanghai, Peoples R China
[3] Shanghai Inst Intelligent Elect & Syst, Shanghai, Peoples R China
基金
中国国家自然科学基金;
关键词
ALGORITHMS;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text clustering is a widely studied problem in the text mining domain. The Dirichlet Multinomial Mixture (DMM) model based clustering algorithms have shown good performance to cope with high dimensional sparse text data, obtaining reasonable results in both clustering accuracy and computational efficiency. However, the time complexity of DMM model training is proportional to the average document length and the number of clusters, making it inefficient for scaling up to long text and large corpora, which is common in real-world applications such as documents organization, retrieval and recommendation. In this paper, we leverage a symmetric prior setting for Dirichlet distribution, and build indices to decrease the time complexity of the sampling-based training for DMM from O(K * L) to O(K * U), where K is the number of clusters, L the average length of document, and U the average number of unique words in each document. We introduce a Metropolis-Hastings sampling algorithm, which further reduces the sampling time complexity from O(K*U) to O(U) in the nearly-to-convergence training stages. Moreover, we also parallelize the DMM model training to obtain a further acceleration by using an uncollapsed Gibbs sampler. We combine all these optimizations into a highly efficient implementation, called X-DMM, which enables the DMM model to scale up for long and large-scale text clustering. We evaluate the performance of X-DMM on several real world datasets, and the experimental results show that X-DMM achieves substantial speed up compared with existing state-of-the-art algorithms without clustering accuracy degradation.
引用
收藏
页码:4197 / 4204
页数:8
相关论文
共 50 条
  • [31] Density based text clustering
    Ikonomakis, E. K.
    Tasoulis, D. K.
    Vrahatis, M. N.
    RECENT PROGRESS IN COMPUTATIONAL SCIENCES AND ENGINEERING, VOLS 7A AND 7B, 2006, 7A-B : 218 - 221
  • [32] HLDA BASED TEXT CLUSTERING
    Liu, Pingan
    Li, Lei
    Heng, Wei
    Wang, Boyuan
    2012 IEEE 2nd International Conference on Cloud Computing and Intelligent Systems (CCIS) Vols 1-3, 2012, : 1465 - 1469
  • [33] Fast Growing Self Organizing Map for Text Clustering
    Matharage, Sumith
    Alahakoon, Damminda
    Rajapakse, Jayantha
    Huang, Pin
    NEURAL INFORMATION PROCESSING, PT II, 2011, 7063 : 406 - +
  • [34] A simple and fast term selection procedure for text clustering
    Gonzaga, Luiz
    Grivet, Marco
    TerezaVasconcelos, Ana
    PROCEEDINGS OF THE 7TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS DESIGN AND APPLICATIONS, 2007, : 777 - 781
  • [35] A Very Fast Method for Clustering Big Text Datasets
    Lin, Frank
    Cohen, WilliamW.
    ECAI 2010 - 19TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2010, 215 : 303 - 308
  • [36] A Scalable Short-Text Clustering Algorithm Using Apache Spark
    Akritidis, Leonidas
    Alamaniotis, Miltiadis
    Fevgas, Athanasios
    Bozanis, Panayiotis
    2021 IEEE 33RD INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2021), 2021, : 927 - 934
  • [37] SUBSCALE: Fast and Scalable Subspace Clustering for High Dimensional Data
    Kaur, Amardeep
    Datta, Amitava
    2014 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOP (ICDMW), 2014, : 621 - 628
  • [38] DBSCAN plus plus : Towards fast and scalable density clustering
    Jang, Jennifer
    Jiang, Heinrich
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97
  • [39] Fast and Scalable Image Retrieval Using Predictive Clustering Trees
    Dimitrovski, Ivica
    Kocev, Dragi
    Loskovska, Suzana
    Dzeroski, Saso
    DISCOVERY SCIENCE, 2013, 8140 : 33 - 48
  • [40] Halite: Fast and Scalable Multiresolution Local-Correlation Clustering
    Cordeiro, Robson L. F.
    Traina, Agma J. M.
    Faloutsos, Christos
    Traina, Caetano, Jr.
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2013, 25 (02) : 387 - 401