Stochastic Variational Optimization of a Hierarchical Dirichlet Process Latent Beta-Liouville Topic Model

被引：1

作者：

Ihou, Koffi Eddy ^{[1
]}

Amayri, Manar ^{[2
]}

Bouguila, Nizar ^{[1
]}

机构：

[1] Concordia Univ, Montreal, PQ H3G 1M8, Canada

[2] Grenoble Inst Technol, F-38031 Grenoble, France

来源：

ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA | 2022年 / 16卷 / 05期

基金：

加拿大自然科学与工程研究理事会;

关键词：

Hierarchical dirichlet process; Bayesian nonparametric topic model; Beta-Liouville distribution; stochastic and variational optimizations; predictive distributions; POISSON-DIRICHLET; MIXTURE-MODELS; DISTRIBUTIONS;

D O I：

10.1145/3502727

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In topic models, collections are organized as documents where they arise as mixtures over latent clusters called topics. A topic is a distribution over the vocabulary. In large-scale applications, parametric or finite topic mixture models such as LDA (latent Dirichlet allocation) and its variants are very restrictive in performance due to their reduced hypothesis space. In this article, we address the problem related to model selection and sharing ability of topics across multiple documents in standard parametric topic models. We propose as an alternative a BNP (Bayesian nonparametric) topic model where the HDP (hierarchical Dirichlet process) prior models documents topic mixtures through their multinomials on infinite simplex. We, therefore, propose asymmetric BL (Beta-Liouville) as a diffuse base measure at the corpus level DP (Dirichlet process) over a measurable space. This step illustrates the highly heterogeneous structure in the set of all topics that describes the corpus probability measure. For consistency in posterior inference and predictive distributions, we efficiently characterize random probability measures whose limits are the global and local DPs to approximate the HDP from the stick-breaking formulation with the GEM (Griffiths-Engen-McCloskey) random variables. Due to the diffuse measure with the BL prior as conjugate to the count data distribution, we obtain an improved version of the standard HDP that is usually based on symmetric Dirichlet (Dir). In addition, to improve coordinate ascent framework while taking advantage of its deterministic nature, our model implements an online optimization method based on stochastic, at document level, variational inference to accommodate fast topic learning when processing large collections of text documents with natural gradient. The high value in the predictive likelihood per document obtained when compared to the performance of its competitors is also consistent with the robustness of our fully asymmetric BL-based HDP. While insuring the predictive accuracy of the model using the probability of the held-out documents, we also added a combination of metrics such as the topic coherence and topic diversity to improve the quality and interpretability of the topics discovered. We also compared the performance of our model using thesemetrics against the standard symmetric LDA. We show that online HDP-LBLA (Latent BL Allocation)'s performance is the asymptote for parametric topicmodels. The accuracy in the results (improved predictive distributions of the held out) is a product of the model's ability to efficiently characterize dependency between documents (topic correlation) as now they can easily share topics, resulting in a much robust and realistic compression algorithm for information modeling.

引用

页数：48

共 50 条

[21] Bayesian Folding-In Using Generalized Dirichlet and Beta-Liouville Kernels for Information Retrieval
Yazdi, Sahar Salmanzade
Najar, Fatma
Bouguila, Nizar
[J]. 2022 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (SSCI), 2022, : 1430 - 1435
[22] HDPauthor: A New Hybrid Author-Topic Model using Latent Dirichlet Allocation and Hierarchical Dirichlet Processes
Yang, Ming
Hsu, Willian H.
[J]. PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW'16 COMPANION), 2016, : 619 - 624
[23] A Latent Variational Framework for Stochastic Optimization
Casgrain, Philippe
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
[24] Multimodal action recognition using variational-based Beta-Liouville hidden Markov models
Ali, Samr
Bouguila, Nizar
[J]. IET IMAGE PROCESSING, 2020, 14 (17) : 4785 - 4794
[25] Flow Hierarchical Dirichlet Process for Complex Topic Modeling
Han, Zhong-Ming
Zhang, Meng-Mei
Li, Meng-Qi
Duan, Da-Gao
Chen, Yi
[J]. Jisuanji Xuebao/Chinese Journal of Computers, 2019, 42 (07): : 1539 - 1552
[26] Online Sparse Collapsed Hybrid Variational-Gibbs Algorithm for Hierarchical Dirichlet Process Topic Models
Burkhardt, Sophie
Kramer, Stefan
[J]. MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2017, PT II, 2017, 10535 : 189 - 204
[27] A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
Masada, Tomonari
Takasu, Atsuhiro
[J]. COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2016, PT IV, 2016, 9789 : 232 - 245
[28] Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation
Foulds, James
Boyles, Levi
DuBois, Christopher
Smyth, Padhraic
Welling, Max
[J]. 19TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING (KDD'13), 2013, : 446 - 454
[29] GPLDA: A Generalized Poisson Latent Dirichlet Topic Model
Bala, Ibrahim Bakari
Saringat, Mohd Zainuri
[J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2019, 10 (12) : 403 - 407
[30] Maximum A Posteriori Approximation of Dirichlet and Beta-Liouville Hidden Markov Models for Proportional Sequential Data Modeling
Ali, Samr
Bouguila, Nizar
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2020, : 4081 - 4087

← 1 2 3 4 5 →