A Scalable Asynchronous Distributed Algorithm for Topic Modeling

被引:31
|
作者
Yu, Hsiang-Fu [1 ]
Hsieh, Cho-Jui [1 ]
Yun, Hyokun [2 ]
Vishwanathan, S. V. N. [3 ]
Dhillon, Inderjit S. [1 ]
机构
[1] Univ Texas Austin, Austin, TX 78712 USA
[2] Amazon, Seattle, WA USA
[3] Univ Calif Santa Cruz, Santa Cruz, CA 95064 USA
基金
美国国家科学基金会;
关键词
Topic Models; Scalability; Sampling;
D O I
10.1145/2736277.2741682
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Learning meaningful topic models with massive document collections which contain millions of documents and billions of tokens is challenging because of two reasons. First, one needs to deal with a large number of topics (typically on the order of thousands). Second, one needs a scalable and efficient way of distributing the computation across multiple machines. In this paper, we present a novel algorithm F+Nomad LDA which simultaneously tackles both these problems. In order to handle large number of topics we use an appropriately modified Fenwick tree. This data structure allows us to sample from a multinomial distribution over T items in O(log T) time. Moreover, when topic counts change the data structure can be updated in O(log T) time. In order to distribute the computation across multiple processors, we present a novel asynchronous framework inspired by the Nomad algorithm of [25]. We show that F+Nomad LDA significantly outperforms recent state-of-the-art topic modeling approaches on massive problems which involve millions of documents, billions of words, and thousands of topics.
引用
收藏
页码:1340 / 1350
页数:11
相关论文
共 50 条
  • [1] An asynchronous distributed and scalable generalized Nash equilibrium seeking algorithm for strongly monotone games
    Cenedese, Carlo
    Belgioioso, Giuseppe
    Grammatico, Sergio
    Cao, Ming
    EUROPEAN JOURNAL OF CONTROL, 2021, 58 : 143 - 151
  • [2] Asynchronous distributed estimation of topic models for document analysis
    Asuncion, Arthur U.
    Smyth, Padhraic
    Welling, Max
    STATISTICAL METHODOLOGY, 2011, 8 (01) : 3 - 17
  • [3] Group Matrix Factorization for Scalable Topic Modeling
    Wang, Quan
    Cao, Zheng
    Xu, Jun
    Li, Hang
    SIGIR 2012: PROCEEDINGS OF THE 35TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2012, : 375 - 384
  • [4] A scalable, asynchronous spanning tree algorithm on a cluster of SMPs
    Cong, Guojing
    Xue, Hanhong
    2008 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, VOLS 1-8, 2008, : 2182 - +
  • [5] Scalable Collectives for Distributed Asynchronous Many-Task Runtimes
    Whitlock, Matthew
    Kolla, Hemanth
    Treichler, Sean
    Pebay, Philippe
    Bennett, Janine C.
    2018 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW 2018), 2018, : 436 - 445
  • [6] Distributed Scalable Collaborative Filtering Algorithm
    Narang, Ankur
    Srivastava, Abhinav
    Katta, Naga Praveen Kumar
    EURO-PAR 2011 PARALLEL PROCESSING, PT 1, 2011, 6852 : 353 - 365
  • [7] Scalable Deep Poisson Factor Analysis for Topic Modeling
    Gan, Zhe
    Chen, Changyou
    Henao, Ricardo
    Carlson, David
    Carin, Lawrence
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 37, 2015, 37 : 1823 - 1832
  • [8] SCALABLE AND DISTRIBUTED MATHEMATICAL MODELING ALGORITHM DESIGN AND PERFORMANCE EVALUATION IN HETEROGENEOUS COMPUTING CLUSTERS
    Liu, Zhouding
    Li, Jia
    SCALABLE COMPUTING-PRACTICE AND EXPERIENCE, 2024, 25 (05): : 3812 - 3821
  • [9] Asynchronous Implementation of a Distributed Average Consensus Algorithm
    Kriegleder, Maximilian
    Oung, Raymond
    D'Andrea, Raffaello
    2013 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2013, : 1836 - 1841
  • [10] Asynchronous Distributed Optimal Load Scheduling Algorithm
    Wang, Qi
    Wu, Wenchuan
    Lin, Chenhui
    Li, Li
    Yang, Yinguo
    2020 IEEE POWER & ENERGY SOCIETY GENERAL MEETING (PESGM), 2020,