Inference and evaluation of the multinomial mixture model for text clustering

被引:65
|
作者
Rigouste, Lois
Cappe, Olivier
Yvon, Francois
机构
[1] GET Telecom Paris, F-75634 Paris 13, France
[2] CNRS, LTCI, F-75634 Paris 13, France
关键词
multinomial mixture model; expectation-maximization; Gibbs sampling; text clustering;
D O I
10.1016/j.ipm.2006.11.001
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
in this article, we investigate the use of a. probabilistic model for unsupervised clustering in text collections. Unsupervised clustering has become a basic module for many intelligent text processing applications, such as information retrieval, text classification or information extraction. Recent proposals have been made of probabilistic clustering models, which build "soft" theme-document associations. These models allow to compute, for each document, a probability vector whose values can be interpreted as the strength of the association between documents and clusters. As such, these vectors can also serve to project texts into a lower-dimensional "semantic" space. These models however pose non-trivial estimation problems, which are aggravated by the very high dimensionality of the parameter space. The model considered in this paper consists of a mixture of multinomial distributions over the word counts, each component corresponding to a different theme. We propose a systematic evaluation framework to contrast various estimation procedures for this model. Starting with the expectation-maximization (EM) algorithm as the basic tool for inference, we discuss the importance of initialization and the influence of other features, such as the smoothing strategy or the size of the vocabulary, thereby illustrating the difficulties incurred by the high dimensionality of the parameter space. We empirically show that, in the case of text processing, these difficulties can be alleviated by introducing the vocabulary incrementally, due to the specific profile of the word count distributions. Using the fact that the model parameters can be analytically integrated out, we finally show that Gibbs sampling on the theme configurations is tractable and compares favorably to the basic EM approach. (c) 2006 Elsevier Ltd. All rights reserved.
引用
收藏
页码:1260 / 1280
页数:21
相关论文
共 50 条
  • [31] Short text clustering based on Pitman-Yor process mixture model
    Qiang, Jipeng
    Li, Yun
    Yuan, Yunhao
    Wu, Xindong
    APPLIED INTELLIGENCE, 2018, 48 (07) : 1802 - 1812
  • [32] Model based clustering of multinomial count data
    Papastamoulis, Panagiotis
    ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 2023,
  • [33] Finite mixture-of-gamma distributions: estimation, inference, and model-based clustering
    Young, Derek S.
    Chen, Xi
    Hewage, Dilrukshi C.
    Nilo-Poyanco, Ricardo
    ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 2019, 13 (04) : 1053 - 1082
  • [34] Finite mixture-of-gamma distributions: estimation, inference, and model-based clustering
    Derek S. Young
    Xi Chen
    Dilrukshi C. Hewage
    Ricardo Nilo-Poyanco
    Advances in Data Analysis and Classification, 2019, 13 : 1053 - 1082
  • [35] ALTERNATIVE COMPUTATIONAL APPROACHES TO INFERENCE IN THE MULTINOMIAL PROBIT MODEL
    GEWEKE, J
    KEANE, M
    RUNKLE, D
    REVIEW OF ECONOMICS AND STATISTICS, 1994, 76 (04) : 609 - 632
  • [36] A New Effective Neural Variational Model with Mixture-of-Gaussians Prior for Text Clustering
    Li, Miao
    Tang, Hongyin
    Jin, Beihong
    Zong, Chengqing
    2019 IEEE 31ST INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2019), 2019, : 1390 - 1395
  • [37] A Dirichlet process biterm-based mixture model for short text stream clustering
    Chen, Junyang
    Gong, Zhiguo
    Liu, Weiwen
    APPLIED INTELLIGENCE, 2020, 50 (05) : 1609 - 1619
  • [38] A Dirichlet process biterm-based mixture model for short text stream clustering
    Junyang Chen
    Zhiguo Gong
    Weiwen Liu
    Applied Intelligence, 2020, 50 : 1609 - 1619
  • [39] Semantic Evaluation of Text Clustering
    Sinh Hoa Nguyen
    Swieboda, Wojciech
    Hung Son Nguyen
    ADVANCED COMPUTATIONAL METHODS FOR KNOWLEDGE ENGINEERING, 2014, 282 : 269 - 280
  • [40] Variational Bayes estimation of hierarchical Dirichlet-multinomial mixtures for text clustering
    Bilancia, Massimo
    Di Nanni, Michele
    Manca, Fabio
    Pio, Gianvito
    COMPUTATIONAL STATISTICS, 2023, 38 (04) : 2015 - 2051