Latent IBP Compound Dirichlet Allocation

被引:13
|
作者
Archambeau, Cedric [1 ]
Lakshminarayanan, Balaji [3 ]
Bouchard, Guillaume [2 ]
机构
[1] Amazon Berlin, Berlin, Germany
[2] Xerox Res Ctr Europe, Meylan, France
[3] UCL, CSML, Gatsby Computat Neurosci Unit, London, England
关键词
Bayesian nonparametrics; power-law distribution; sparse modelling; topic modelling; clustering; bag-of-words representation; Gibbs sampling;
D O I
10.1109/TPAMI.2014.2313122
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We introduce the four-parameter IBP compound Dirichlet process (ICDP), a stochastic process that generates sparse non-negative vectors with potentially an unbounded number of entries. If we repeatedly sample from the ICDP we can generate sparse matrices with an infinite number of columns and power-law characteristics. We apply the four-parameter ICDP to sparse nonparametric topic modelling to account for the very large number of topics present in large text corpora and the power-law distribution of the vocabulary of natural languages. The model, which we call latent IBP compound Dirichlet allocation (LIDA), allows for power-law distributions, both, in the number of topics summarising the documents and in the number of words defining each topic. It can be interpreted as a sparse variant of the hierarchical Pitman-Yor process when applied to topic modelling. We derive an efficient and simple collapsed Gibbs sampler closely related to the collapsed Gibbs sampler of latent Dirichlet allocation (LDA), making the model applicable in a wide range of domains. Our nonparametric Bayesian topic model compares favourably to the widely used hierarchical Dirichlet process and its heavy tailed version, the hierarchical Pitman-Yor process, on benchmark corpora. Experiments demonstrate that accounting for the power-distribution of real data is beneficial and that sparsity provides more interpretable results.
引用
收藏
页码:321 / 333
页数:13
相关论文
共 50 条
  • [1] Latent Dirichlet allocation
    Blei, DM
    Ng, AY
    Jordan, MI
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) : 993 - 1022
  • [2] Latent Dirichlet allocation
    Blei, DM
    Ng, AY
    Jordan, MI
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 14, VOLS 1 AND 2, 2002, 14 : 601 - 608
  • [3] Sequential latent Dirichlet allocation
    Du, Lan
    Buntine, Wray
    Jin, Huidong
    Chen, Changyou
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2012, 31 (03) : 475 - 503
  • [4] Collective Latent Dirichlet Allocation
    Shen, Zhi-Yong
    Sun, Jun
    Shen, Yi-Dong
    [J]. ICDM 2008: EIGHTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2008, : 1019 - 1024
  • [5] The Security of Latent Dirichlet Allocation
    Mei, Shike
    Zhu, Xiaojin
    [J]. ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 38, 2015, 38 : 681 - 689
  • [6] Sequential latent Dirichlet allocation
    Lan Du
    Wray Buntine
    Huidong Jin
    Changyou Chen
    [J]. Knowledge and Information Systems, 2012, 31 : 475 - 503
  • [7] Distributed Latent Dirichlet Allocation on Streams
    Guo, Yunyan
    Li, Jianzhong
    [J]. ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2022, 16 (01)
  • [8] Parallel Latent Dirichlet Allocation on GPUs
    Moon, Gordon E.
    Nisa, Israt
    Sukumaran-Rajam, Aravind
    Bandyopadhyay, Bortik
    Parthasarathy, Srinivasan
    Sadayappan, P.
    [J]. COMPUTATIONAL SCIENCE - ICCS 2018, PT II, 2018, 10861 : 259 - 272
  • [9] INFERENCE IN SUPERVISED LATENT DIRICHLET ALLOCATION
    Lakshminarayanan, Balaji
    Raich, Raviv
    [J]. 2011 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2011,
  • [10] Slow mixing for Latent Dirichlet Allocation
    Jonasson, Johan
    [J]. STATISTICS & PROBABILITY LETTERS, 2017, 129 : 96 - 100