Dataless Short Text Classification Based on Biterm Topic Model and Word Embeddings

被引:0
|
作者
Yang, Yi [1 ,3 ]
Wang, Hongan [1 ,2 ,3 ]
Zhu, Jiaqi [1 ,2 ,3 ]
Wu, Yunkun [3 ]
Jiang, Kailong [3 ]
Guo, Wenli [3 ]
Shi, Wandong [3 ]
机构
[1] Chinese Acad Sci, Inst Software, SKLCS, Beijing, Peoples R China
[2] Zhejiang Lab, Hangzhou, Peoples R China
[3] Univ Chinese Acad Sci, Beijing, Peoples R China
基金
国家重点研发计划;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Dataless text classification has attracted increasing attentions recently. It only needs very few seed words of each category to classify documents, which is much cheaper than supervised text classification that requires massive labeling efforts. However, most of existing models pay attention to long texts, but get unsatisfactory performance on short texts, which have become increasingly popular on the Internet. In this paper, we at first propose a novel model named Seeded Biterm Topic Model (SeedBTM) extending BTM to solve the problem of dataless short text classification with seed words. It takes advantage of both word co-occurrence information in the topic model and category-word similarity from widely used word embeddings as the prior topic-in-set knowledge. Moreover, with the same approach, we also propose Seeded Twitter Biterm Topic Model (SeedTBTM), which extends Twitter-BTM and utilizes additional user information to achieve higher classification accuracy. Experimental results on five real short-text datasets show that our models outperform the state-of-the-art methods, and especially perform well when the categories are overlapping and interrelated.
引用
收藏
页码:3969 / 3975
页数:7
相关论文
共 50 条
  • [21] Arabic Text Classification Based on Word and Document Embeddings
    El Mahdaouy, Abdelkader
    Gaussier, Eric
    El Alaoui, Said Ouatik
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON ADVANCED INTELLIGENT SYSTEMS AND INFORMATICS 2016, 2017, 533 : 32 - 41
  • [22] Text Similarity Function Based on Word Embeddings for Short Text Analysis
    Pascual, Adrian Jimenez
    Fujita, Sumio
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING (CICLING 2017), PT I, 2018, 10761 : 391 - 402
  • [23] Text Classification Using Word Embeddings
    Helaskar, Mukund N.
    Sonawane, Sheetal S.
    2019 5TH INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION, CONTROL AND AUTOMATION (ICCUBEA), 2019,
  • [24] A Method of Short Text Representation Fusion with Weighted Word Embeddings and Extended Topic Information
    Liu, Wenfu
    Pang, Jianmin
    Du, Qiming
    Li, Nan
    Yang, Shudan
    SENSORS, 2022, 22 (03)
  • [25] Sequence-Based Word Embeddings for Effective Text Classification
    Gomes, Bruno Guilherme
    Murai, Fabricio
    Goussevskaia, Olga
    Couto da Silva, Ana Paula
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS (NLDB 2021), 2021, 12801 : 135 - 146
  • [26] User Based Aggregation for Biterm Topic Model
    Chen, Weizheng
    Wang, Jinpeng
    Zhang, Yan
    Yan, Hongfei
    Li, Xiaoming
    PROCEEDINGS OF THE 53RD ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL) AND THE 7TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (IJCNLP), VOL 2, 2015, : 489 - 494
  • [27] A clustering-based topic model using word networks and word embeddings
    Mu, Wenchuan
    Lim, Kwan Hui
    Liu, Junhua
    Karunasekera, Shanika
    Falzon, Lucia
    Harwood, Aaron
    JOURNAL OF BIG DATA, 2022, 9 (01)
  • [28] A clustering-based topic model using word networks and word embeddings
    Wenchuan Mu
    Kwan Hui Lim
    Junhua Liu
    Shanika Karunasekera
    Lucia Falzon
    Aaron Harwood
    Journal of Big Data, 9
  • [29] Text classification with semantically enriched word embeddings
    Pittaras, N.
    Giannakopoulos, G.
    Papadakis, G.
    Karkaletsis, V
    NATURAL LANGUAGE ENGINEERING, 2021, 27 (04) : 391 - 425
  • [30] A Robust User Sentiment Biterm Topic Mixture Model Based on User Aggregation Strategy to Avoid Data Sparsity for Short Text
    Nimala K
    Jebakumar R
    Journal of Medical Systems, 2019, 43 (4)