X-DMM: Fast and Scalable Model Based Text Clustering

被引：0

作者：

Li, Linwei ^{[1
]}

Guo, Liangchen ^{[1
]}

He, Zhenying ^{[1
,2
,3
]}

Jing, Yinan ^{[1
,2
,3
]}

Wang, X. Sean ^{[1
,2
,3
]}

机构：

[1] Fudan Univ, Sch Comp Sci & Technol, Shanghai, Peoples R China

[2] Shanghai Key Lab Data Sci, Shanghai, Peoples R China

[3] Shanghai Inst Intelligent Elect & Syst, Shanghai, Peoples R China

来源：

THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE | 2019年

基金：

中国国家自然科学基金;

关键词：

ALGORITHMS;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Text clustering is a widely studied problem in the text mining domain. The Dirichlet Multinomial Mixture (DMM) model based clustering algorithms have shown good performance to cope with high dimensional sparse text data, obtaining reasonable results in both clustering accuracy and computational efficiency. However, the time complexity of DMM model training is proportional to the average document length and the number of clusters, making it inefficient for scaling up to long text and large corpora, which is common in real-world applications such as documents organization, retrieval and recommendation. In this paper, we leverage a symmetric prior setting for Dirichlet distribution, and build indices to decrease the time complexity of the sampling-based training for DMM from O(K * L) to O(K * U), where K is the number of clusters, L the average length of document, and U the average number of unique words in each document. We introduce a Metropolis-Hastings sampling algorithm, which further reduces the sampling time complexity from O(K*U) to O(U) in the nearly-to-convergence training stages. Moreover, we also parallelize the DMM model training to obtain a further acceleration by using an uncollapsed Gibbs sampler. We combine all these optimizations into a highly efficient implementation, called X-DMM, which enables the DMM model to scale up for long and large-scale text clustering. We evaluate the performance of X-DMM on several real world datasets, and the experimental results show that X-DMM achieves substantial speed up compared with existing state-of-the-art algorithms without clustering accuracy degradation.

引用

页码：4197 / 4204

页数：8

共 50 条

[21] A Model-based Approach for Text Clustering with Outlier Detection
Yin, Jianhua
Wang, Jianyong
2016 32ND IEEE INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2016, : 625 - 636
[22] Research of feature selection for text clustering based on cloud model
Zhao, Junmin
Zhang, Kai
Wan, Jian
Journal of Software, 2013, 8 (12) : 3246 - 3252
[23] A Text Document Clustering Method Based on Weighted BERT Model
Li, Yutong
Cai, Juanjuan
Wang, Jingling
PROCEEDINGS OF 2020 IEEE 4TH INFORMATION TECHNOLOGY, NETWORKING, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (ITNEC 2020), 2020, : 1426 - 1430
[24] Knowledge-based vector space model for text clustering
Liping Jing
Michael K. Ng
Joshua Z. Huang
Knowledge and Information Systems, 2010, 25 : 35 - 55
[25] A WordNet-based Semantic Model for Enhancing Text Clustering
Shehata, Shady
2009 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2009), 2009, : 477 - 482
[26] A Text Mining Model Based on Improved Density Clustering Algorithm
Chen Qi
Lu Jianfeng
Zhang Hao
2013 IEEE 4TH INTERNATIONAL CONFERENCE ON ELECTRONICS INFORMATION AND EMERGENCY COMMUNICATION (ICEIEC), 2014, : 337 - 339
[27] Transformer Fault Recognition Based on Kbert Text Clustering Model
Jiang C.
Wang Y.
Chen M.
Li C.
Wang Y.
Ma G.
Gaodianya Jishu/High Voltage Engineering, 2022, 48 (08): : 2991 - 3000
[28] Knowledge-based vector space model for text clustering
Jing, Liping
Ng, Michael K.
Huang, Joshua Z.
KNOWLEDGE AND INFORMATION SYSTEMS, 2010, 25 (01) : 35 - 55
[29] Fast model-based clustering of partial records
Goren, Emily M.
Maitra, Ranjan
STAT, 2022, 11 (01):
[30] Scalable model-based cluster analysis using clustering features
Jin, HD
Leung, KS
Wong, ML
Xu, ZB
PATTERN RECOGNITION, 2005, 38 (05) : 637 - 649

← 1 2 3 4 5 →