A Nested Chinese Restaurant Topic Model for Short Texts with Document Embeddings

被引:4
|
作者
Niu, Yue [1 ]
Zhang, Hongjie [1 ]
Li, Jing [1 ]
机构
[1] Univ Sci & Technol China, Dept Comp Sci & Technol, Hefei 230052, Peoples R China
来源
APPLIED SCIENCES-BASEL | 2021年 / 11卷 / 18期
关键词
topic model; text mining; document embeddings; short text;
D O I
10.3390/app11188708
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
In recent years, short texts have become a kind of prevalent text on the internet. Due to the short length of each text, conventional topic models for short texts suffer from the sparsity of word co-occurrence information. Researchers have proposed different kinds of customized topic models for short texts by providing additional word co-occurrence information. However, these models cannot incorporate sufficient semantic word co-occurrence information and may bring additional noisy information. To address these issues, we propose a self-aggregated topic model incorporating document embeddings. Aggregating short texts into long documents according to document embeddings can provide sufficient word co-occurrence information and avoid incorporating non-semantic word co-occurrence information. However, document embeddings of short texts contain a lot of noisy information resulting from the sparsity of word co-occurrence information. So we discard noisy information by changing the document embeddings into global and local semantic information. The global semantic information is the similarity probability distribution on the entire dataset and the local semantic information is the distances of similar short texts. Then we adopt a nested Chinese restaurant process to incorporate these two kinds of information. Finally, we compare our model to several state-of-the-art models on four real-world short texts corpus. The experiment results show that our model achieves better performances in terms of topic coherence and classification accuracy.
引用
收藏
页数:19
相关论文
共 50 条
  • [1] Hierarchical topic models and the nested chinese restaurant process
    Blei, DM
    Griffiths, TL
    Jordan, MI
    Tenenbaum, JB
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 16, 2004, 16 : 17 - 24
  • [2] Topic Discovery for Short Texts Using Word Embeddings
    Xun, Guangxu
    Gopalakrishnan, Vishrawas
    Ma, Fenglong
    Li, Yaliang
    Gao, Jing
    Zhang, Aidong
    [J]. 2016 IEEE 16TH INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2016, : 1299 - 1304
  • [3] Topic Modeling for Short Texts with Auxiliary Word Embeddings
    Li, Chenliang
    Wang, Haoran
    Zhang, Zhiqian
    Sun, Aixin
    Ma, Zongyang
    [J]. SIGIR'16: PROCEEDINGS OF THE 39TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2016, : 165 - 174
  • [4] Authorship Attribution for Short Texts with Author-Document Topic Model
    Zhang, Haowen
    Nie, Peng
    Wen, Yanlong
    Yuan, Xiaojie
    [J]. KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT (KSEM 2018), PT I, 2018, 11061 : 29 - 41
  • [5] Enhancing Topic Modeling for Short Texts with Auxiliary Word Embeddings
    Li, Chenliang
    Duan, Yu
    Wang, Haoran
    Zhang, Zhiqian
    Sun, Aixin
    Ma, Zongyang
    [J]. ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2017, 36 (02)
  • [6] Topic Modeling over Short Texts by Incorporating Word Embeddings
    Qiang, Jipeng
    Chen, Ping
    Wang, Tong
    Wu, Xindong
    [J]. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2017, PT II, 2017, 10235 : 363 - 374
  • [7] Topic extraction by clustering word embeddings on short online texts
    Nabergoj, David
    D’Alconzo, Alessandro
    Valerio, Danilo
    Štrumbelj, Erik
    [J]. Elektrotehniski Vestnik/Electrotechnical Review, 2022, 89 (1-2): : 64 - 72
  • [8] Topic extraction by clustering word embeddings on short online texts
    Nabergoj, David
    D'Alconzo, Alessandro
    Valerio, Danilo
    Strumbelj, Erik
    [J]. ELEKTROTEHNISKI VESTNIK, 2022, 89 (1-2): : 64 - 72
  • [9] The Nested Chinese Restaurant Process and Bayesian Nonparametric Inference of Topic Hierarchies
    Blei, David M.
    Griffiths, Thomas L.
    Jordan, Michael I.
    [J]. JOURNAL OF THE ACM, 2010, 57 (02)
  • [10] Automatic Topic Modeling for Single Document Short Texts
    Sajid, Anamta
    Jan, Sadaqat
    Shah, Ibrar A.
    [J]. 2017 INTERNATIONAL CONFERENCE ON FRONTIERS OF INFORMATION TECHNOLOGY (FIT), 2017, : 70 - 75