Scalable Training of Hierarchical Topic Models

被引：10

作者：

Chen, Jianfei ^{[1
]}

Zhu, Jun ^{[1
]}

Lu, Jie ^{[2
]}

Liu, Shixia ^{[2
]}

机构：

[1] Tsinghua Univ, BNRist Ctr, State Key Lab Intell Tech & Sys, Dept Comp Sci & Tech, Beijing 100084, Peoples R China

[2] Tsinghua Univ, BNRist Ctr, State Key Lab Intell Tech & Sys, Sch Software, Beijing 100084, Peoples R China

来源：

PROCEEDINGS OF THE VLDB ENDOWMENT | 2018年 / 11卷 / 07期

基金：

北京市自然科学基金;

关键词：

DIRICHLET; INFERENCE;

D O I：

10.14778/3192965.3192972

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Large-scale topic models serve as basic tools for feature extraction and dimensionality reduction in many practical applications. As a natural extension of flat topic models, hierarchical topic models (HTMs) are able to learn topics of different levels of abstraction, which lead to deeper understanding and better generalization than their flat counterparts. However, existing scalable systems for flat topic models cannot handle HTMs, due to their complicated data structures such as trees and concurrent dynamically growing matrices, as well as their susceptibility to local optima. In this paper, we study the hierarchical latent Dirichlet allocation (hLDA) model which is a powerful nonparametric Bayesian HTM. We propose an efficient partially collapsed Gibbs sampling algorithm for hLDA, as well as an initialization strategy to deal with local optima introduced by tree-structured models. We also identify new system challenges in building scalable systems for HTMs, and propose efficient data layout for vectorizing HTM as well as distributed data structures including dynamic matrices and trees. Empirical studies show that our system is 87 times more efficient than the previous open-source implementation for hLDA, and can scale to thousands of CPU cores. We demonstrate our scalability on a 131-million-document corpus with 28 billion tokens, which is 4-5 orders of magnitude larger than previously used corpus. Our distributed implementation can extract 1,722 topics from the corpus with 50 machines in just 7 hours.

引用

页码：826 / 839

页数：14

共 50 条

[31] Hierarchical Theme and Topic Modeling
Chien, Jen-Tzung
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2016, 27 (03) : 565 - 578
[32] An Attention Hierarchical Topic Modeling
Yongheng Chunyan Yin
Wanli Chen
Pattern Recognition and Image Analysis, 2021, 31 : 722 - 729
[33] An overview of Hierarchical topic modeling
Liu, Lin
Tang, Lin
He, Libo
Zhou, Wei
Yao, Shaowen
2016 8TH INTERNATIONAL CONFERENCE ON INTELLIGENT HUMAN-MACHINE SYSTEMS AND CYBERNETICS (IHMSC), VOL. 1, 2016, : 391 - 394
[34] A Topic Model for Hierarchical Documents
Yang, Yang
Wang, Feifei
Jiang, Fei
Jin, Shuyuan
Xu, Jin
2016 IEEE FIRST INTERNATIONAL CONFERENCE ON DATA SCIENCE IN CYBERSPACE (DSC 2016), 2016, : 118 - 126
[35] An Attention Hierarchical Topic Modeling
Yin, Chunyan
Chen, Yongheng
Zuo, Wanli
PATTERN RECOGNITION AND IMAGE ANALYSIS, 2021, 31 (04) : 722 - 729
[36] Topic-based hierarchical Bayesian linear regression models for niche items recommendation
Liu, Yezheng
Xiong, Qiang
Sun, Jianshan
Jiang, Yuanchun
Silva, Thushari
Ling, Haifeng
JOURNAL OF INFORMATION SCIENCE, 2019, 45 (01) : 92 - 104
[37] Hierarchical topic modeling with nested hierarchical Dirichlet process
Yi-qun Ding
Shan-ping Li
Zhen Zhang
Bin Shen
Journal of Zhejiang University-SCIENCE A, 2009, 10 : 858 - 867
[38] Hierarchical topic modeling with nested hierarchical Dirichlet process
Yi-qun DING1
Journal of Zhejiang University-Science A(Applied Physics & Engineering), 2009, 10 (06) : 858 - 867
[39] A Novel Document Generation Process for Topic Detection Based on Hierarchical Latent Tree Models
Chen, Peixian
Chen, Zhourong
Zhang, Nevin L.
SYMBOLIC AND QUANTITATIVE APPROACHES TO REASONING WITH UNCERTAINTY, ECSQARU 2019, 2019, 11726 : 265 - 276
[40] Hierarchical topic modeling with nested hierarchical Dirichlet process
Ding, Yi-qun
Li, Shan-ping
Zhang, Zhen
Shen, Bin
JOURNAL OF ZHEJIANG UNIVERSITY-SCIENCE A, 2009, 10 (06): : 858 - 867

← 1 2 3 4 5 →