Scalable Training of Hierarchical Topic Models

被引:10
|
作者
Chen, Jianfei [1 ]
Zhu, Jun [1 ]
Lu, Jie [2 ]
Liu, Shixia [2 ]
机构
[1] Tsinghua Univ, BNRist Ctr, State Key Lab Intell Tech & Sys, Dept Comp Sci & Tech, Beijing 100084, Peoples R China
[2] Tsinghua Univ, BNRist Ctr, State Key Lab Intell Tech & Sys, Sch Software, Beijing 100084, Peoples R China
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2018年 / 11卷 / 07期
基金
北京市自然科学基金;
关键词
DIRICHLET; INFERENCE;
D O I
10.14778/3192965.3192972
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Large-scale topic models serve as basic tools for feature extraction and dimensionality reduction in many practical applications. As a natural extension of flat topic models, hierarchical topic models (HTMs) are able to learn topics of different levels of abstraction, which lead to deeper understanding and better generalization than their flat counterparts. However, existing scalable systems for flat topic models cannot handle HTMs, due to their complicated data structures such as trees and concurrent dynamically growing matrices, as well as their susceptibility to local optima. In this paper, we study the hierarchical latent Dirichlet allocation (hLDA) model which is a powerful nonparametric Bayesian HTM. We propose an efficient partially collapsed Gibbs sampling algorithm for hLDA, as well as an initialization strategy to deal with local optima introduced by tree-structured models. We also identify new system challenges in building scalable systems for HTMs, and propose efficient data layout for vectorizing HTM as well as distributed data structures including dynamic matrices and trees. Empirical studies show that our system is 87 times more efficient than the previous open-source implementation for hLDA, and can scale to thousands of CPU cores. We demonstrate our scalability on a 131-million-document corpus with 28 billion tokens, which is 4-5 orders of magnitude larger than previously used corpus. Our distributed implementation can extract 1,722 topics from the corpus with 50 machines in just 7 hours.
引用
收藏
页码:826 / 839
页数:14
相关论文
共 50 条
  • [1] Sparse Parallel Training of Hierarchical Dirichlet Process Topic Models
    Terenin, Alexander
    Magnusson, Mans
    Jonsson, Leif
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 2925 - 2934
  • [2] Scalable Generalized Dynamic Topic Models
    Jaehnichen, Patrick
    Wenzel, Florian
    Kloft, Marius
    Mandt, Stephan
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 84, 2018, 84
  • [3] Neural Topic Models for Hierarchical Topic Detection and Visualization
    Pham, Dang
    Le, Than M., V
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2021: RESEARCH TRACK, PT III, 2021, 12977 : 35 - 51
  • [4] Ranking Answers by Hierarchical Topic Models
    Qin, Zengchang
    Thint, Marcus
    Huang, Zhiheng
    NEXT-GENERATION APPLIED INTELLIGENCE, PROCEEDINGS, 2009, 5579 : 103 - +
  • [5] Hierarchical Topic Models for Expanding Category Hierarchies
    Yamamoto, Kohei
    Eguchi, Koji
    Takasu, Atsuhiro
    2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP), 2019, : 242 - 249
  • [6] Latent tree models for hierarchical topic detection
    Chen, Peixian
    Zhang, Nevin L.
    Liu, Tengfei
    Poon, Leonard K. M.
    Chen, Zhourong
    Khawar, Farhan
    ARTIFICIAL INTELLIGENCE, 2017, 250 : 105 - 124
  • [7] Scalable Factorized Hierarchical Variational Autoencoder Training
    Hsu, Wei-Ning
    Glass, James
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 1462 - 1466
  • [8] Scalable Inference in Max-margin Topic Models
    Zhu, Jun
    Zheng, Xun
    Zhou, Li
    Zhang, Bo
    19TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING (KDD'13), 2013, : 964 - 972
  • [9] Scalable Rejection Sampling for Bayesian Hierarchical Models
    Braun, Michael
    Damien, Paul
    MARKETING SCIENCE, 2016, 35 (03) : 427 - 444
  • [10] Hierarchical topic models and the nested chinese restaurant process
    Blei, DM
    Griffiths, TL
    Jordan, MI
    Tenenbaum, JB
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 16, 2004, 16 : 17 - 24