Scalable Training of Hierarchical Topic Models

被引:10
|
作者
Chen, Jianfei [1 ]
Zhu, Jun [1 ]
Lu, Jie [2 ]
Liu, Shixia [2 ]
机构
[1] Tsinghua Univ, BNRist Ctr, State Key Lab Intell Tech & Sys, Dept Comp Sci & Tech, Beijing 100084, Peoples R China
[2] Tsinghua Univ, BNRist Ctr, State Key Lab Intell Tech & Sys, Sch Software, Beijing 100084, Peoples R China
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2018年 / 11卷 / 07期
基金
北京市自然科学基金;
关键词
DIRICHLET; INFERENCE;
D O I
10.14778/3192965.3192972
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Large-scale topic models serve as basic tools for feature extraction and dimensionality reduction in many practical applications. As a natural extension of flat topic models, hierarchical topic models (HTMs) are able to learn topics of different levels of abstraction, which lead to deeper understanding and better generalization than their flat counterparts. However, existing scalable systems for flat topic models cannot handle HTMs, due to their complicated data structures such as trees and concurrent dynamically growing matrices, as well as their susceptibility to local optima. In this paper, we study the hierarchical latent Dirichlet allocation (hLDA) model which is a powerful nonparametric Bayesian HTM. We propose an efficient partially collapsed Gibbs sampling algorithm for hLDA, as well as an initialization strategy to deal with local optima introduced by tree-structured models. We also identify new system challenges in building scalable systems for HTMs, and propose efficient data layout for vectorizing HTM as well as distributed data structures including dynamic matrices and trees. Empirical studies show that our system is 87 times more efficient than the previous open-source implementation for hLDA, and can scale to thousands of CPU cores. We demonstrate our scalability on a 131-million-document corpus with 28 billion tokens, which is 4-5 orders of magnitude larger than previously used corpus. Our distributed implementation can extract 1,722 topics from the corpus with 50 machines in just 7 hours.
引用
收藏
页码:826 / 839
页数:14
相关论文
共 50 条
  • [41] Topic Models with Topic Ordering Regularities for Topic Segmentation
    Du, Lan
    Pate, John K.
    Johnson, Mark
    2014 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2014, : 803 - 808
  • [42] Scalable Inference and Training of Context-Rich Syntactic Translation Models
    Galley, Michel
    Graehl, Jonathan
    Knight, Kevin
    Marcu, Daniel
    DeNeefe, Steve
    Wang, Wei
    Thayer, Ignacio
    COLING/ACL 2006, VOLS 1 AND 2, PROCEEDINGS OF THE CONFERENCE, 2006, : 961 - 968
  • [43] TencentPretrain: A Scalable and Flexible Toolkit for Pre -training Models of Different Modalities
    Zhao, Zhe
    Li, Yudong
    Hon, Cheng
    Zhao, Jing
    Tian, Rong
    Lin, Weijie
    Chen, Yiren
    Sun, Ningynan
    Lin, Flaoyan
    Mao, Weiquan
    Gun, Han
    Guol, Weigang
    Wu, Taiqiang
    Zhu, Tao
    Shi, Wenhang
    Chen, Chen
    Huang, Shan
    Chen, Sihong
    Liu, Liqun
    Lil, Feifei
    Chen, Xiaoshuai
    Sun, Xingwu
    Kang, Zhanhui
    Du, Xiaoyong
    Shen, Linlin
    Yan, Kinuno
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-DEMO 2023, VOL 3, 2023, : 217 - 225
  • [44] Towards Training Probabilistic Topic Models on Neuromorphic Multi-Chip Systems
    Xiao, Zihao
    Chen, Jianfei
    Zhu, Jun
    THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 6459 - 6466
  • [45] Text classification method based on self-training and LDA topic models
    Pavlinek, Miha
    Podgorelec, Vili
    EXPERT SYSTEMS WITH APPLICATIONS, 2017, 80 : 83 - 93
  • [46] Scalable Hierarchical Agglomerative Clustering
    Monath, Nicholas
    Dubey, Kumar Avinava
    Guruganesh, Guru
    Zaheer, Manzil
    Ahmed, Amr
    McCallum, Andrew
    Mergen, Gokhan
    Najork, Marc
    Terzihan, Mert
    Tjanaka, Bryon
    Wang, Yuan
    Wu, Yuchen
    KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 1245 - 1255
  • [47] A hierarchical architecture for a scalable multicast
    Benslimane, A
    Moussaoui, O
    COMPUTER AND INFORMATION SCIENCES - ISCIS 2003, 2003, 2869 : 643 - 650
  • [48] TCN: Scalable hierarchical hypercubes
    Lee, TY
    Hsiung, PA
    Chen, SJ
    NINTH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS, PROCEEDINGS, 2002, : 11 - 16
  • [49] Scalable Coordination of Hierarchical Parallelism
    Devadas, Vinay
    Curtis-Maury, Matthew
    PROCEEDINGS OF THE 49TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2020, 2020,
  • [50] Scalable Target Marketing: Distributed Markov Chain Monte Carlo for Bayesian Hierarchical Models
    Bumbaca, Federico
    Misra, Sanjog
    Rossi, Peter E.
    JOURNAL OF MARKETING RESEARCH, 2020, 57 (06) : 999 - 1018