Embedding Semantic Anchors to Guide Topic Models on Short Text Corpora

被引:3
|
作者
Steuber, Florian [1 ]
Schneider, Sinclair [1 ]
Schoenfeld, Mirco [2 ]
机构
[1] Univ Bundeswehr Munchen, Res Inst CODE, Neubiberg, Germany
[2] Univ Bayreuth, Bayreuth, Germany
关键词
Topic modeling; Short text; Word embedding; Transfer learning; Big data;
D O I
10.1016/j.bdr.2021.100293
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Documents on the social media platform Twitter are formulated in short and simple style, instead of being written extensively and elaborately. Further, the core message of a post is often encoded into characteristic phrases called hashtags. These hashtags illustrate the semantics of a post or tie it to a specific topic. In this paper, we propose multiple approaches of using hashtags and their surrounding texts to improve topic modeling of short texts. We use transfer learning by applying a pre-trained word embedding of hashtags to derive preliminary topics. These function as supervising information, or seed topics and are passed to Archetypal LDA (A-LDA), a recent variant of Latent Dirichlet Allocation. We demonstrate the effectiveness of our approach using a large corpus of posts exemplarily on Twitter. Our approaches improve the topic model's qualities in terms of various quantitative metrics. Moreover, the presented algorithms used to extract seed topics can be utilized as form of lightweight topic model by themselves. Hence, our approaches create additional analytical opportunities and can help to gain a more detailed understanding of what people are talking about on social media. By using big data in terms of millions of tweets for preprocessing and fine-tuning, we enable the classification algorithm to produce topics that are very coherent to the reader. (C) 2021 Elsevier Inc. All rights reserved.
引用
收藏
页数:13
相关论文
共 50 条
  • [11] Incorporating Embedding to Topic Modeling for More Effective Short Text Analysis
    Rashid, Junaid
    Kim, Jungeun
    Naseem, Usman
    COMPANION OF THE WORLD WIDE WEB CONFERENCE, WWW 2023, 2023, : 73 - 76
  • [12] TopExplorer: Tool Support for Extracting and Visualizing Topic Models in Bioengineering Text Corpora
    Cheng, Kwok Sun
    Wang, Zhipeng
    Huang, Pei-Chi
    Chundi, Parvathi
    Song, Myoungkyu
    2020 IEEE INTERNATIONAL CONFERENCE ON ELECTRO INFORMATION TECHNOLOGY (EIT), 2020, : 334 - 343
  • [13] Comparing text corpora via topic modelling
    Krasnov, Fedor
    Shvartsman, Mikhail
    Dimentov, Alexander
    INTERNATIONAL JOURNAL OF DATA MINING MODELLING AND MANAGEMENT, 2022, 14 (03) : 203 - 216
  • [14] Constructing Pseudo Documents with Semantic Similarity for Short Text Topic Discovery
    Lu, Heng-yang
    Li, Yun
    Tang, Chi
    Wang, Chong-jun
    Xie, Jun-yuan
    NEURAL INFORMATION PROCESSING (ICONIP 2018), PT V, 2018, 11305 : 437 - 449
  • [15] Fast Supervised Topic Models for Short Text Emotion Detection
    Pang, Jianhui
    Rao, Yanghui
    Xie, Haoran
    Wang, Xizhao
    Wang, Fu Lee
    Wong, Tak-Lam
    Li, Qing
    IEEE TRANSACTIONS ON CYBERNETICS, 2021, 51 (02) : 815 - 828
  • [16] Maintaining Topic Models for Growing Corpora
    Kuhr, Felix
    Bender, Magnus
    Braun, Tanya
    Moeller, Ralf
    2020 IEEE 14TH INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC 2020), 2020, : 451 - 458
  • [17] Comprehensive Analysis of Topic Models and Long Text Data for Short
    Goyal, Astha
    Kashyap, Indu
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2023, 14 (12) : 249 - 259
  • [18] Exploiting Global Semantic Similarity Biterms for Short-text Topic Discovery
    Lu, Heng-yang
    Ge, Gao-jian
    Li, Yun
    Wang, Chong-jun
    Xie, Jun-yuan
    2018 IEEE 30TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI), 2018, : 975 - 982
  • [19] SeNSe: embedding alignment via semantic anchors selection
    Malandri, Lorenzo
    Mercorio, Fabio
    Mezzanzanica, Mario
    Pallucchini, Filippo
    INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS, 2024,
  • [20] Short Text Clustering based on Word Semantic Graph with Word Embedding Model
    Jinarat, Supakpong
    Manaskasemsak, Bundit
    Rungsawang, Arnon
    2018 JOINT 10TH INTERNATIONAL CONFERENCE ON SOFT COMPUTING AND INTELLIGENT SYSTEMS (SCIS) AND 19TH INTERNATIONAL SYMPOSIUM ON ADVANCED INTELLIGENT SYSTEMS (ISIS), 2018, : 1427 - 1432