A Pseudo-document-based Topical N-grams model for short texts

被引:5
|
作者
Lin, Hao [1 ]
Zuo, Yuan [1 ]
Liu, Guannan [1 ]
Li, Hong [1 ]
Wu, Junjie [1 ,2 ,3 ]
Wu, Zhiang [4 ]
机构
[1] Beihang Univ, Sch Econ & Management, Beijing 100191, Peoples R China
[2] Beihang Univ, Beijing Adv Innovat Ctr Big Data & Brain Comp, Beijing 100191, Peoples R China
[3] Beihang Univ, Beijing Key Lab Emergency Support Simulat Technol, Beijing 100191, Peoples R China
[4] Nanjing Univ Finance & Econ, Jiangsu Prov Key Lab E Business, Nanjing, Peoples R China
基金
中国博士后科学基金; 中国国家自然科学基金; 国家重点研发计划;
关键词
Short text; Topic model; Word order; Topical N-Grams; PHRASE;
D O I
10.1007/s11280-020-00814-x
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, short text topic modeling has drawn considerable attentions from interdisciplinary researchers. Various customized topic models have been proposed to tackle the semantic sparseness nature of short texts. Most (if not all) of them follow thebag-of-wordsassumption, which, however, is not adequate since word order and phrases are often critical to capturing the meaning of texts. On the other hand, while some existing topic models are sensitive to word order, they do not perform well on short texts due to the severe data sparseness. To address these issues, we propose the Pseudo-document-based Topical N-Grams model (PTNG), which alleviates the data sparsity problem of short texts while is sensitive to word order. Extensive experiments on three real-world data sets with state-of-the-art baselines demonstrate the high quality of topics learned by PTNG according to UCI coherence scores and more discriminative semantic representation of short texts according to classification results.
引用
收藏
页码:3001 / 3023
页数:23
相关论文
共 50 条
  • [1] A Pseudo-document-based Topical N-grams model for short texts
    Hao Lin
    Yuan Zuo
    Guannan Liu
    Hong Li
    Junjie Wu
    Zhiang Wu
    [J]. World Wide Web, 2020, 23 : 3001 - 3023
  • [2] Feature Extension for Chinese Short Text Classification Based on Topical N-Grams
    Sun, Baoshan
    Zhao, Peng
    [J]. 2017 16TH IEEE/ACIS INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION SCIENCE (ICIS 2017), 2017, : 477 - 482
  • [3] Enhancing CRF Model with N-grams for Arabic Texts Chunking
    Khoufi, Nabil
    Aloulou, Chafik
    Belguith, Lamia Hadrich
    [J]. INNOVATION VISION 2020: FROM REGIONAL DEVELOPMENT SUSTAINABILITY TO GLOBAL ECONOMIC GROWTH, VOL I-VI, 2015, : 2877 - 2884
  • [4] Diacritics restoration based on word n-grams for Slovak texts
    Toth, Stefan
    Zaymus, Emanuel
    Duracik, Michal
    Hrkut, Patrik
    Mesko, Matej
    [J]. OPEN COMPUTER SCIENCE, 2021, 11 (01): : 180 - 189
  • [5] Language Identification in Multilingual, Short and Noisy Texts using Common N-Grams
    Kosmajac, Dijana
    Keselj, Vlado
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 2752 - 2759
  • [6] Interpolated N-Grams for Model Based Testing
    Tonella, Paolo
    Tiella, Roberto
    Cu Duy Nguyen
    [J]. 36TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2014), 2014, : 562 - 572
  • [7] Automatic restoration of diacritics based on word n-grams for Slovak texts
    Toth, Stefan
    Zaymus, Emanuel
    Duracik, Michal
    Mesko, Matej
    Hrkut, Patrik
    [J]. 2019 IEEE 15TH INTERNATIONAL SCIENTIFIC CONFERENCE ON INFORMATICS (INFORMATICS 2019), 2019, : 243 - 248
  • [8] Which Granularity to Bootstrap a Multilingual Method of Document Alignment: Character N-grams or Word N-grams?
    Lecluze, Charlotte
    Rigouste, Lois
    Giguet, Emmanuel
    Lucas, Nadine
    [J]. CORPUS RESOURCES FOR DESCRIPTIVE AND APPLIED STUDIES. CURRENT CHALLENGES AND FUTURE DIRECTIONS: SELECTED PAPERS FROM THE 5TH INTERNATIONAL CONFERENCE ON CORPUS LINGUISTICS (CILC2013), 2013, 95 : 473 - 481
  • [9] Authorship Identification of the Azerbaijani Texts Using n-grams
    Aida-zade, K. R.
    Talibov, S. Q.
    [J]. 2016 IEEE 10TH INTERNATIONAL CONFERENCE ON APPLICATION OF INFORMATION AND COMMUNICATION TECHNOLOGIES (AICT), 2016, : 210 - 212
  • [10] An improved N-grams based Model for Authorship Attribution
    Boughaci, Dalila
    Benmesbah, Mounir
    Zebiri, Aniss
    [J]. 2019 INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION SCIENCES (ICCIS), 2019, : 70 - 75