Inference and evaluation of the multinomial mixture model for text clustering

被引：65

作者：

Rigouste, Lois

Cappe, Olivier

Yvon, Francois

机构：

[1] GET Telecom Paris, F-75634 Paris 13, France

[2] CNRS, LTCI, F-75634 Paris 13, France

来源：

INFORMATION PROCESSING & MANAGEMENT | 2007年 / 43卷 / 05期

关键词：

multinomial mixture model; expectation-maximization; Gibbs sampling; text clustering;

D O I：

10.1016/j.ipm.2006.11.001

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

in this article, we investigate the use of a. probabilistic model for unsupervised clustering in text collections. Unsupervised clustering has become a basic module for many intelligent text processing applications, such as information retrieval, text classification or information extraction. Recent proposals have been made of probabilistic clustering models, which build "soft" theme-document associations. These models allow to compute, for each document, a probability vector whose values can be interpreted as the strength of the association between documents and clusters. As such, these vectors can also serve to project texts into a lower-dimensional "semantic" space. These models however pose non-trivial estimation problems, which are aggravated by the very high dimensionality of the parameter space. The model considered in this paper consists of a mixture of multinomial distributions over the word counts, each component corresponding to a different theme. We propose a systematic evaluation framework to contrast various estimation procedures for this model. Starting with the expectation-maximization (EM) algorithm as the basic tool for inference, we discuss the importance of initialization and the influence of other features, such as the smoothing strategy or the size of the vocabulary, thereby illustrating the difficulties incurred by the high dimensionality of the parameter space. We empirically show that, in the case of text processing, these difficulties can be alleviated by introducing the vocabulary incrementally, due to the specific profile of the word count distributions. Using the fact that the model parameters can be analytically integrated out, we finally show that Gibbs sampling on the theme configurations is tractable and compares favorably to the basic EM approach. (c) 2006 Elsevier Ltd. All rights reserved.

引用

页码：1260 / 1280

页数：21

共 50 条

[41] Variational Bayes estimation of hierarchical Dirichlet-multinomial mixtures for text clustering
Massimo Bilancia
Michele Di Nanni
Fabio Manca
Gianvito Pio
Computational Statistics, 2023, 38 : 2015 - 2051
[42] Fuzzy Co-clustering Induced by q-Multinomial Mixture Models
Kanzawa, Yuchi
2017 IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS (FUZZ-IEEE), 2017,
[43] Improved Inference of Gaussian Mixture Copula Model for Clustering and Reproducibility Analysis using Automatic Differentiation
Kasa, Siva Rajesh
Rajan, Vaibhav
ECONOMETRICS AND STATISTICS, 2022, 22 : 67 - 97
[44] A mixture model for pose clustering
Moss, S
Wilson, RC
Hancock, ER
PATTERN RECOGNITION LETTERS, 1999, 20 (11-13) : 1093 - 1101
[45] Mixture model modal clustering
Chacon, Jose E.
ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 2019, 13 (02) : 379 - 404
[46] Mixture model averaging for clustering
Wei, Yuhong
McNicholas, Paul D.
ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 2015, 9 (02) : 197 - 217
[47] A mixture model for clustering ensembles
Topchy, A
Jain, AK
Punch, W
PROCEEDINGS OF THE FOURTH SIAM INTERNATIONAL CONFERENCE ON DATA MINING, 2004, : 379 - 390
[48] Mixture model modal clustering
José E. Chacón
Advances in Data Analysis and Classification, 2019, 13 : 379 - 404
[49] Mixture model averaging for clustering
Yuhong Wei
Paul D. McNicholas
Advances in Data Analysis and Classification, 2015, 9 : 197 - 217
[50] Maximising entropy on the nonparametric predictive inference model for multinomial data
Abellan, Joaquin
Baker, Rebecca M.
Coolen, Frank P. A.
EUROPEAN JOURNAL OF OPERATIONAL RESEARCH, 2011, 212 (01) : 112 - 122

← 1 2 3 4 5 →