X-DMM: Fast and Scalable Model Based Text Clustering

被引:0
|
作者
Li, Linwei [1 ]
Guo, Liangchen [1 ]
He, Zhenying [1 ,2 ,3 ]
Jing, Yinan [1 ,2 ,3 ]
Wang, X. Sean [1 ,2 ,3 ]
机构
[1] Fudan Univ, Sch Comp Sci & Technol, Shanghai, Peoples R China
[2] Shanghai Key Lab Data Sci, Shanghai, Peoples R China
[3] Shanghai Inst Intelligent Elect & Syst, Shanghai, Peoples R China
基金
中国国家自然科学基金;
关键词
ALGORITHMS;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text clustering is a widely studied problem in the text mining domain. The Dirichlet Multinomial Mixture (DMM) model based clustering algorithms have shown good performance to cope with high dimensional sparse text data, obtaining reasonable results in both clustering accuracy and computational efficiency. However, the time complexity of DMM model training is proportional to the average document length and the number of clusters, making it inefficient for scaling up to long text and large corpora, which is common in real-world applications such as documents organization, retrieval and recommendation. In this paper, we leverage a symmetric prior setting for Dirichlet distribution, and build indices to decrease the time complexity of the sampling-based training for DMM from O(K * L) to O(K * U), where K is the number of clusters, L the average length of document, and U the average number of unique words in each document. We introduce a Metropolis-Hastings sampling algorithm, which further reduces the sampling time complexity from O(K*U) to O(U) in the nearly-to-convergence training stages. Moreover, we also parallelize the DMM model training to obtain a further acceleration by using an uncollapsed Gibbs sampler. We combine all these optimizations into a highly efficient implementation, called X-DMM, which enables the DMM model to scale up for long and large-scale text clustering. We evaluate the performance of X-DMM on several real world datasets, and the experimental results show that X-DMM achieves substantial speed up compared with existing state-of-the-art algorithms without clustering accuracy degradation.
引用
收藏
页码:4197 / 4204
页数:8
相关论文
共 50 条
  • [21] A Model-based Approach for Text Clustering with Outlier Detection
    Yin, Jianhua
    Wang, Jianyong
    2016 32ND IEEE INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2016, : 625 - 636
  • [22] Research of feature selection for text clustering based on cloud model
    Zhao, Junmin
    Zhang, Kai
    Wan, Jian
    Journal of Software, 2013, 8 (12) : 3246 - 3252
  • [23] A Text Document Clustering Method Based on Weighted BERT Model
    Li, Yutong
    Cai, Juanjuan
    Wang, Jingling
    PROCEEDINGS OF 2020 IEEE 4TH INFORMATION TECHNOLOGY, NETWORKING, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (ITNEC 2020), 2020, : 1426 - 1430
  • [24] Knowledge-based vector space model for text clustering
    Liping Jing
    Michael K. Ng
    Joshua Z. Huang
    Knowledge and Information Systems, 2010, 25 : 35 - 55
  • [25] A WordNet-based Semantic Model for Enhancing Text Clustering
    Shehata, Shady
    2009 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2009), 2009, : 477 - 482
  • [26] A Text Mining Model Based on Improved Density Clustering Algorithm
    Chen Qi
    Lu Jianfeng
    Zhang Hao
    2013 IEEE 4TH INTERNATIONAL CONFERENCE ON ELECTRONICS INFORMATION AND EMERGENCY COMMUNICATION (ICEIEC), 2014, : 337 - 339
  • [27] Transformer Fault Recognition Based on Kbert Text Clustering Model
    Jiang C.
    Wang Y.
    Chen M.
    Li C.
    Wang Y.
    Ma G.
    Gaodianya Jishu/High Voltage Engineering, 2022, 48 (08): : 2991 - 3000
  • [28] Knowledge-based vector space model for text clustering
    Jing, Liping
    Ng, Michael K.
    Huang, Joshua Z.
    KNOWLEDGE AND INFORMATION SYSTEMS, 2010, 25 (01) : 35 - 55
  • [29] Fast model-based clustering of partial records
    Goren, Emily M.
    Maitra, Ranjan
    STAT, 2022, 11 (01):
  • [30] Scalable model-based cluster analysis using clustering features
    Jin, HD
    Leung, KS
    Wong, ML
    Xu, ZB
    PATTERN RECOGNITION, 2005, 38 (05) : 637 - 649