Dataless Short Text Classification Based on Biterm Topic Model and Word Embeddings

被引:0
|
作者
Yang, Yi [1 ,3 ]
Wang, Hongan [1 ,2 ,3 ]
Zhu, Jiaqi [1 ,2 ,3 ]
Wu, Yunkun [3 ]
Jiang, Kailong [3 ]
Guo, Wenli [3 ]
Shi, Wandong [3 ]
机构
[1] Chinese Acad Sci, Inst Software, SKLCS, Beijing, Peoples R China
[2] Zhejiang Lab, Hangzhou, Peoples R China
[3] Univ Chinese Acad Sci, Beijing, Peoples R China
基金
国家重点研发计划;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Dataless text classification has attracted increasing attentions recently. It only needs very few seed words of each category to classify documents, which is much cheaper than supervised text classification that requires massive labeling efforts. However, most of existing models pay attention to long texts, but get unsatisfactory performance on short texts, which have become increasingly popular on the Internet. In this paper, we at first propose a novel model named Seeded Biterm Topic Model (SeedBTM) extending BTM to solve the problem of dataless short text classification with seed words. It takes advantage of both word co-occurrence information in the topic model and category-word similarity from widely used word embeddings as the prior topic-in-set knowledge. Moreover, with the same approach, we also propose Seeded Twitter Biterm Topic Model (SeedTBTM), which extends Twitter-BTM and utilizes additional user information to achieve higher classification accuracy. Experimental results on five real short-text datasets show that our models outperform the state-of-the-art methods, and especially perform well when the categories are overlapping and interrelated.
引用
收藏
页码:3969 / 3975
页数:7
相关论文
共 50 条
  • [41] A short text topic modeling method based on integrating Gaussian and Logistic coding networks with pre-trained word embeddings
    Zhang, Si
    Xu, Jiali
    Hui, Ning
    Zhai, Peiyun
    Neurocomputing, 2025, 616
  • [42] Probabilistic topic modeling for short text based on word embedding networks
    Pita, Marcelo
    Nunes, Matheus
    Pappa, Gisele L.
    APPLIED INTELLIGENCE, 2022, 52 (15) : 17829 - 17844
  • [43] Probabilistic topic modeling for short text based on word embedding networks
    Marcelo Pita
    Matheus Nunes
    Gisele L. Pappa
    Applied Intelligence, 2022, 52 : 17829 - 17844
  • [44] Topic Evolutionary Analysis of Short Text Based on Word Vector and BTM
    Zhang P.
    Liu D.
    Data Analysis and Knowledge Discovery, 2019, 3 (03) : 95 - 101
  • [45] Short Text Embedding for Clustering based on Word and Topic Semantic Information
    Chen, Ziheng
    Ren, Jiangtao
    2019 IEEE INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS (DSAA 2019), 2019, : 61 - 70
  • [46] Word-class embeddings for multiclass text classification
    Alejandro Moreo
    Andrea Esuli
    Fabrizio Sebastiani
    Data Mining and Knowledge Discovery, 2021, 35 : 911 - 963
  • [47] Enhancing Topic Modeling for Short Texts with Auxiliary Word Embeddings
    Li, Chenliang
    Duan, Yu
    Wang, Haoran
    Zhang, Zhiqian
    Sun, Aixin
    Ma, Zongyang
    ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2017, 36 (02)
  • [48] Word-class embeddings for multiclass text classification
    Moreo, Alejandro
    Esuli, Andrea
    Sebastiani, Fabrizio
    DATA MINING AND KNOWLEDGE DISCOVERY, 2021, 35 (03) : 911 - 963
  • [49] An analysis of hierarchical text classification using word embeddings
    Stein, Roger Alan
    Jaques, Patricia A.
    Valiati, Joao Francisco
    INFORMATION SCIENCES, 2019, 471 : 216 - 232
  • [50] Topic extraction by clustering word embeddings on short online texts
    Nabergoj, David
    D’Alconzo, Alessandro
    Valerio, Danilo
    Štrumbelj, Erik
    Elektrotehniski Vestnik/Electrotechnical Review, 2022, 89 (1-2): : 64 - 72