X-DMM: Fast and Scalable Model Based Text Clustering

被引：0

作者：

Li, Linwei ^{[1
]}

Guo, Liangchen ^{[1
]}

He, Zhenying ^{[1
,2
,3
]}

Jing, Yinan ^{[1
,2
,3
]}

Wang, X. Sean ^{[1
,2
,3
]}

机构：

[1] Fudan Univ, Sch Comp Sci & Technol, Shanghai, Peoples R China

[2] Shanghai Key Lab Data Sci, Shanghai, Peoples R China

[3] Shanghai Inst Intelligent Elect & Syst, Shanghai, Peoples R China

来源：

THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE | 2019年

基金：

中国国家自然科学基金;

关键词：

ALGORITHMS;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Text clustering is a widely studied problem in the text mining domain. The Dirichlet Multinomial Mixture (DMM) model based clustering algorithms have shown good performance to cope with high dimensional sparse text data, obtaining reasonable results in both clustering accuracy and computational efficiency. However, the time complexity of DMM model training is proportional to the average document length and the number of clusters, making it inefficient for scaling up to long text and large corpora, which is common in real-world applications such as documents organization, retrieval and recommendation. In this paper, we leverage a symmetric prior setting for Dirichlet distribution, and build indices to decrease the time complexity of the sampling-based training for DMM from O(K * L) to O(K * U), where K is the number of clusters, L the average length of document, and U the average number of unique words in each document. We introduce a Metropolis-Hastings sampling algorithm, which further reduces the sampling time complexity from O(K*U) to O(U) in the nearly-to-convergence training stages. Moreover, we also parallelize the DMM model training to obtain a further acceleration by using an uncollapsed Gibbs sampler. We combine all these optimizations into a highly efficient implementation, called X-DMM, which enables the DMM model to scale up for long and large-scale text clustering. We evaluate the performance of X-DMM on several real world datasets, and the experimental results show that X-DMM achieves substantial speed up compared with existing state-of-the-art algorithms without clustering accuracy degradation.

引用

页码：4197 / 4204

页数：8

共 50 条

[1] Scalable k-NN based text clustering
Lulli, Alessandro
Debatty, Thibault
Dell'Amico, Matteo
Michiardi, Pietro
Ricci, Laura
PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2015, : 958 - 963
[2] Scalable, balanced model-based clustering
Zhong, S
Ghosh, J
PROCEEDINGS OF THE THIRD SIAM INTERNATIONAL CONFERENCE ON DATA MINING, 2003, : 71 - 82
[3] Scalable text semantic clustering around topics
Brena, Ramon
Ramirez, Eduardo
JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2019, 36 (05) : 4645 - 4657
[4] Fast and Scalable Protein Motif Sequence Clustering based on Hadoop Framework
Farhangi, Erfan
Ghadiri, Nasser
Asadi, Mahsa
Nikbakht, Mohammad Amin
Pitre, Sylvain
2017 3RD INTERNATIONAL CONFERENCE ON WEB RESEARCH (ICWR), 2017, : 24 - 31
[5] A generic query-based model for scalable clustering
National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan
NII Tech. Rep., 2006, 8 (19-21):
[6] Improved fast partitional clustering algorithm for text clustering
Bejos, Sebastian
Feliciano-Avelino, Ivan
Martinez-Trinidad, J. Fco.
Carrasco-Ochoa, J. A.
JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2020, 39 (02) : 2137 - 2145
[7] An Ant-based Fast Text Clustering Approach Using Pheromone
Zhang, Fuzhi
Ma, Yujing
Hou, Na
Liu, Hui
FIFTH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, VOL 2, PROCEEDINGS, 2008, : 385 - 389
[8] Summarization of Text Clustering based Vector Space Model
Chen, Mingzhen
Song, Yu
2009 IEEE 10TH INTERNATIONAL CONFERENCE ON COMPUTER-AIDED INDUSTRIAL DESIGN & CONCEPTUAL DESIGN, VOLS 1-3: E-BUSINESS, CREATIVE DESIGN, MANUFACTURING - CAID&CD'2009, 2009, : 2362 - 2365
[9] A Wikipedia-based Semantic Model for Text Clustering
Zhou, Jing-min
Cui, Qing-jun
Zhang, Hui
2011 INTERNATIONAL CONFERENCE ON FUTURE COMPUTER SCIENCE AND APPLICATION (FCSA 2011), VOL 2, 2011, : 413 - 416
[10] The research on text clustering based on LDA joint model
Li, Chen
Yang, Cheng
Jiang, Qin
JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2017, 32 (05) : 3655 - 3667

← 1 2 3 4 5 →