Asynchronous distributed estimation of topic models for document analysis

被引:8
|
作者
Asuncion, Arthur U. [1 ]
Smyth, Padhraic [1 ]
Welling, Max [1 ]
机构
[1] Univ Calif Irvine, Dept Comp Sci, Irvine, CA 92717 USA
关键词
Topic model; Distributed learning; Parallelization; Gibbs sampling;
D O I
10.1016/j.stamet.2010.03.002
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Given the prevalence of large data sets and the availability of inexpensive parallel computing hardware, there is significant motivation to explore distributed implementations of statistical learning algorithms. In this paper, we present a distributed learning framework for Latent Dirichlet Allocation (LDA), a well-known Bayesian latent variable model for sparse matrices of count data. In the proposed approach, data are distributed across P processors, and processors independently perform inference on their local data and communicate their sufficient statistics in a local asynchronous manner with other processors. We apply two different approximate inference techniques for LDA, collapsed Gibbs sampling and collapsed variational inference, within a distributed framework. The results show significant improvements in computation time and memory when running the algorithms on very large text corpora using parallel hardware. Despite the approximate nature of the proposed approach, simulations suggest that asynchronous distributed algorithms are able to learn models that are nearly as accurate as those learned by the standard non-distributed approaches. We also find that our distributed algorithms converge rapidly to good solutions. (C) 2010 Elsevier B.V. All rights reserved.
引用
收藏
页码:3 / 17
页数:15
相关论文
共 50 条
  • [21] Integrating social annotations into topic models for personalized document retrieval
    Xu, Bo
    Lin, Hongfei
    Lin, Yuan
    Guan, Yizhou
    SOFT COMPUTING, 2020, 24 (03) : 1707 - 1716
  • [22] Asynchronous Distributed Nonlinear Estimation Over Directed Networks
    Wang, Qianyao
    Yu, Rui
    Meng, Min
    IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING, 2024, 11 (02): : 2062 - 2073
  • [23] Integrating social annotations into topic models for personalized document retrieval
    Bo Xu
    Hongfei Lin
    Yuan Lin
    Yizhou Guan
    Soft Computing, 2020, 24 : 1707 - 1716
  • [24] Statistical topic models for multi-label document classification
    Timothy N. Rubin
    America Chambers
    Padhraic Smyth
    Mark Steyvers
    Machine Learning, 2012, 88 : 157 - 208
  • [25] Statistical topic models for multi-label document classification
    Rubin, Timothy N.
    Chambers, America
    Smyth, Padhraic
    Steyvers, Mark
    MACHINE LEARNING, 2012, 88 (1-2) : 157 - 208
  • [26] Expert-Informed Topic Models for Document Set Discovery
    Rinke, Eike Mark
    Dobbrick, Timo
    Loeb, Charlotte
    Zirn, Cacilia
    Wessler, Hartmut
    COMMUNICATION METHODS AND MEASURES, 2022, 16 (01) : 39 - 58
  • [27] Distributed Sequential Estimation in Asynchronous Wireless Sensor Networks
    Hlinka, Ondrej
    Hlawatsch, Franz
    Djuric, Petar M.
    IEEE SIGNAL PROCESSING LETTERS, 2015, 22 (11) : 1965 - 1969
  • [28] Robust Unsupervised Segmentation of Degraded Document Images with Topic Models
    Burns, Timothy J.
    Corso, Jason J.
    CVPR: 2009 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOLS 1-4, 2009, : 1287 - 1294
  • [29] A method of refining topic models based on term and document frequencies
    Higashi K.
    Takahashi H.
    Nakagawa H.
    Tsuchiya T.
    Computer Software, 2019, 36 (04) : 25 - 31
  • [30] Table Topic Models for Hidden Unit Estimation
    Yoshida, Minoru
    Matsumoto, Kazuyuki
    Kita, Kenji
    INFORMATION RETRIEVAL TECHNOLOGY, AIRS 2016, 2016, 9994 : 302 - 307