Inference and evaluation of the multinomial mixture model for text clustering

被引:65
|
作者
Rigouste, Lois
Cappe, Olivier
Yvon, Francois
机构
[1] GET Telecom Paris, F-75634 Paris 13, France
[2] CNRS, LTCI, F-75634 Paris 13, France
关键词
multinomial mixture model; expectation-maximization; Gibbs sampling; text clustering;
D O I
10.1016/j.ipm.2006.11.001
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
in this article, we investigate the use of a. probabilistic model for unsupervised clustering in text collections. Unsupervised clustering has become a basic module for many intelligent text processing applications, such as information retrieval, text classification or information extraction. Recent proposals have been made of probabilistic clustering models, which build "soft" theme-document associations. These models allow to compute, for each document, a probability vector whose values can be interpreted as the strength of the association between documents and clusters. As such, these vectors can also serve to project texts into a lower-dimensional "semantic" space. These models however pose non-trivial estimation problems, which are aggravated by the very high dimensionality of the parameter space. The model considered in this paper consists of a mixture of multinomial distributions over the word counts, each component corresponding to a different theme. We propose a systematic evaluation framework to contrast various estimation procedures for this model. Starting with the expectation-maximization (EM) algorithm as the basic tool for inference, we discuss the importance of initialization and the influence of other features, such as the smoothing strategy or the size of the vocabulary, thereby illustrating the difficulties incurred by the high dimensionality of the parameter space. We empirically show that, in the case of text processing, these difficulties can be alleviated by introducing the vocabulary incrementally, due to the specific profile of the word count distributions. Using the fact that the model parameters can be analytically integrated out, we finally show that Gibbs sampling on the theme configurations is tractable and compares favorably to the basic EM approach. (c) 2006 Elsevier Ltd. All rights reserved.
引用
收藏
页码:1260 / 1280
页数:21
相关论文
共 50 条
  • [1] Multinomial mixture model with feature selection for text clustering
    Li, Minqiang
    Zhang, Liang
    KNOWLEDGE-BASED SYSTEMS, 2008, 21 (07) : 704 - 708
  • [2] An Adaptive Dirichlet Multinomial Mixture Model for Short Text Streaming Clustering
    Duan, Ruting
    Li, Chunping
    2018 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE (WI 2018), 2018, : 49 - 55
  • [3] A Dirichlet Multinomial Mixture Model-based Approach for Short Text Clustering
    Yin, Jianhua
    Wang, Jianyong
    PROCEEDINGS OF THE 20TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING (KDD'14), 2014, : 233 - 242
  • [4] Railway Fault Text Clustering Method Using an Improved Dirichlet Multinomial Mixture Model
    Yang, Ni
    Zhang, Youpeng
    MATHEMATICAL PROBLEMS IN ENGINEERING, 2022, 2022
  • [5] Application of multinomial mixture model to text classification
    Novovicová, J
    Malík, A
    PATTERN RECOGNITION AND IMAGE ANALYSIS, PROCEEDINGS, 2003, 2652 : 646 - 653
  • [6] Large Margin Multinomial Mixture Model for Text Categorization
    Pan, Zhen-Yu
    Jiang, Hui
    INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 1566 - +
  • [7] Clustering discrete data through the multinomial mixture model
    Portela, J.
    COMMUNICATIONS IN STATISTICS-THEORY AND METHODS, 2008, 37 (20) : 3250 - 3263
  • [8] Evaluation of the Dirichlet Process Multinomial Mixture Model for Short-Text Topic Modeling
    Karlsson, Alexander
    Duarte, Denio
    Mathiason, Gunnar
    Bae, Juhee
    2018 6TH INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL AND BUSINESS INTELLIGENCE (ISCBI 2018), 2018, : 79 - 83
  • [9] Tensor Dirichlet Process Multinomial Mixture Model with Graphs for Passenger Trajectory Clustering
    Li, Ziyue
    Yan, Hao
    Zhang, Chen
    Ketter, Wolfgang
    Tsung, Fugee
    PROCEEDINGS OF THE 6TH ACM SIGSPATIAL INTERNATIONAL WORKSHOP ON AI FOR GEOGRAPHIC KNOWLEDGE DISCOVERY, GEOAI 2023, 2023, : 121 - 128
  • [10] ChromDMM: a Dirichlet-multinomial mixture model for clustering heterogeneous epigenetic data
    Osmala, Maria
    Eraslan, Gokcen
    Lahdesmaki, Harri
    BIOINFORMATICS, 2022, 38 (16) : 3863 - 3870