Inference and evaluation of the multinomial mixture model for text clustering

被引:65
|
作者
Rigouste, Lois
Cappe, Olivier
Yvon, Francois
机构
[1] GET Telecom Paris, F-75634 Paris 13, France
[2] CNRS, LTCI, F-75634 Paris 13, France
关键词
multinomial mixture model; expectation-maximization; Gibbs sampling; text clustering;
D O I
10.1016/j.ipm.2006.11.001
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
in this article, we investigate the use of a. probabilistic model for unsupervised clustering in text collections. Unsupervised clustering has become a basic module for many intelligent text processing applications, such as information retrieval, text classification or information extraction. Recent proposals have been made of probabilistic clustering models, which build "soft" theme-document associations. These models allow to compute, for each document, a probability vector whose values can be interpreted as the strength of the association between documents and clusters. As such, these vectors can also serve to project texts into a lower-dimensional "semantic" space. These models however pose non-trivial estimation problems, which are aggravated by the very high dimensionality of the parameter space. The model considered in this paper consists of a mixture of multinomial distributions over the word counts, each component corresponding to a different theme. We propose a systematic evaluation framework to contrast various estimation procedures for this model. Starting with the expectation-maximization (EM) algorithm as the basic tool for inference, we discuss the importance of initialization and the influence of other features, such as the smoothing strategy or the size of the vocabulary, thereby illustrating the difficulties incurred by the high dimensionality of the parameter space. We empirically show that, in the case of text processing, these difficulties can be alleviated by introducing the vocabulary incrementally, due to the specific profile of the word count distributions. Using the fact that the model parameters can be analytically integrated out, we finally show that Gibbs sampling on the theme configurations is tractable and compares favorably to the basic EM approach. (c) 2006 Elsevier Ltd. All rights reserved.
引用
收藏
页码:1260 / 1280
页数:21
相关论文
共 50 条
  • [41] Variational Bayes estimation of hierarchical Dirichlet-multinomial mixtures for text clustering
    Massimo Bilancia
    Michele Di Nanni
    Fabio Manca
    Gianvito Pio
    Computational Statistics, 2023, 38 : 2015 - 2051
  • [42] Fuzzy Co-clustering Induced by q-Multinomial Mixture Models
    Kanzawa, Yuchi
    2017 IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS (FUZZ-IEEE), 2017,
  • [43] Improved Inference of Gaussian Mixture Copula Model for Clustering and Reproducibility Analysis using Automatic Differentiation
    Kasa, Siva Rajesh
    Rajan, Vaibhav
    ECONOMETRICS AND STATISTICS, 2022, 22 : 67 - 97
  • [44] A mixture model for pose clustering
    Moss, S
    Wilson, RC
    Hancock, ER
    PATTERN RECOGNITION LETTERS, 1999, 20 (11-13) : 1093 - 1101
  • [45] Mixture model modal clustering
    Chacon, Jose E.
    ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 2019, 13 (02) : 379 - 404
  • [46] Mixture model averaging for clustering
    Wei, Yuhong
    McNicholas, Paul D.
    ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 2015, 9 (02) : 197 - 217
  • [47] A mixture model for clustering ensembles
    Topchy, A
    Jain, AK
    Punch, W
    PROCEEDINGS OF THE FOURTH SIAM INTERNATIONAL CONFERENCE ON DATA MINING, 2004, : 379 - 390
  • [48] Mixture model modal clustering
    José E. Chacón
    Advances in Data Analysis and Classification, 2019, 13 : 379 - 404
  • [49] Mixture model averaging for clustering
    Yuhong Wei
    Paul D. McNicholas
    Advances in Data Analysis and Classification, 2015, 9 : 197 - 217
  • [50] Maximising entropy on the nonparametric predictive inference model for multinomial data
    Abellan, Joaquin
    Baker, Rebecca M.
    Coolen, Frank P. A.
    EUROPEAN JOURNAL OF OPERATIONAL RESEARCH, 2011, 212 (01) : 112 - 122