Asynchronous distributed estimation of topic models for document analysis

被引:8
|
作者
Asuncion, Arthur U. [1 ]
Smyth, Padhraic [1 ]
Welling, Max [1 ]
机构
[1] Univ Calif Irvine, Dept Comp Sci, Irvine, CA 92717 USA
关键词
Topic model; Distributed learning; Parallelization; Gibbs sampling;
D O I
10.1016/j.stamet.2010.03.002
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Given the prevalence of large data sets and the availability of inexpensive parallel computing hardware, there is significant motivation to explore distributed implementations of statistical learning algorithms. In this paper, we present a distributed learning framework for Latent Dirichlet Allocation (LDA), a well-known Bayesian latent variable model for sparse matrices of count data. In the proposed approach, data are distributed across P processors, and processors independently perform inference on their local data and communicate their sufficient statistics in a local asynchronous manner with other processors. We apply two different approximate inference techniques for LDA, collapsed Gibbs sampling and collapsed variational inference, within a distributed framework. The results show significant improvements in computation time and memory when running the algorithms on very large text corpora using parallel hardware. Despite the approximate nature of the proposed approach, simulations suggest that asynchronous distributed algorithms are able to learn models that are nearly as accurate as those learned by the standard non-distributed approaches. We also find that our distributed algorithms converge rapidly to good solutions. (C) 2010 Elsevier B.V. All rights reserved.
引用
收藏
页码:3 / 17
页数:15
相关论文
共 50 条
  • [1] A Scalable Asynchronous Distributed Algorithm for Topic Modeling
    Yu, Hsiang-Fu
    Hsieh, Cho-Jui
    Yun, Hyokun
    Vishwanathan, S. V. N.
    Dhillon, Inderjit S.
    PROCEEDINGS OF THE 24TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW 2015), 2015, : 1340 - 1350
  • [2] LIKELIHOOD ESTIMATION OF SPARSE TOPIC DISTRIBUTIONS IN TOPIC MODELS AND ITS APPLICATIONS TO WASSERSTEIN DOCUMENT DISTANCE CALCULATIONS
    Bing, Xin
    Bunea, Florentina
    Strimas-mackey, Seth
    Wegkamp, Marten
    ANNALS OF STATISTICS, 2022, 50 (06): : 3307 - 3333
  • [3] TOPICVIEW: VISUAL ANALYSIS OF TOPIC MODELS AND THEIR IMPACT ON DOCUMENT CLUSTERING
    Crossno, Patricia J.
    Wilson, Andrew T.
    Shead, Timothy M.
    Davis, Warren L.
    Dunlavy, Daniel M.
    INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2013, 22 (05)
  • [4] Topic Classification Based on Distributed Document Representation and Latent Topic Information
    Chen, Peixin
    Guo, Wu
    Wang, Qingnan
    Song, Yan
    2017 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC 2017), 2017, : 614 - 617
  • [5] Distributed Algorithms for Topic Models
    Newman, David
    Asuncion, Arthur
    Smyth, Padhraic
    Welling, Max
    JOURNAL OF MACHINE LEARNING RESEARCH, 2009, 10 : 1801 - 1828
  • [6] Latent Topic Estimation Based on Events in a Document
    Kitajima, Risa
    Kobayashi, Ichiro
    JOURNAL OF ADVANCED COMPUTATIONAL INTELLIGENCE AND INTELLIGENT INFORMATICS, 2012, 16 (05) : 603 - 610
  • [7] Dynamic Topic Models for Temporal Document Networks
    Zhang, Delvin Ce
    Lauw, Hady W.
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [8] Textual Document Clustering using Topic Models
    Sun, Xiaoping
    2014 10TH INTERNATIONAL CONFERENCE ON SEMANTICS, KNOWLEDGE AND GRIDS (SKG), 2014, : 1 - 4
  • [9] Dynamic Topic Models for Temporal Document Networks
    Zhang, Delvin Ce
    Lauw, Hady W.
    Proceedings of Machine Learning Research, 2022, 162 : 26281 - 26292
  • [10] A distributed, graphical, topic-oriented document search system
    Light, J
    VISUAL DATA EXPLORATION AND ANALYSIS IV, 1997, 3017 : 129 - 135