Short text classification based on strong feature thesaurus

被引:31
|
作者
Wang, Bing-kun [1 ,2 ]
Huang, Yong-feng [1 ,2 ]
Yang, Wan-xia [1 ,2 ]
Li, Xing [1 ,2 ]
机构
[1] Tsinghua Univ, Dept Elect & Engn, Informat Cognit & Intelligent Syst Res Inst, Beijing 100084, Peoples R China
[2] Tsinghua Univ, Informat Technol Natl Lab, Beijing 100084, Peoples R China
关键词
Short text; Classification; Data sparseness; Semantic; Strong feature thesaurus (SFT); Latent Dirichlet allocation (LDA);
D O I
10.1631/jzus.C1100373
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Data sparseness, the evident characteristic of short text, has always been regarded as the main cause of the low accuracy in the classification of short texts using statistical methods. Intensive research has been conducted in this area during the past decade. However, most researchers failed to notice that ignoring the semantic importance of certain feature terms might also contribute to low classification accuracy. In this paper we present a new method to tackle the problem by building a strong feature thesaurus (SFT) based on latent Dirichlet allocation (LDA) and information gain (IG) models. By giving larger weights to feature terms in SFT, the classification accuracy can be improved. Specifically, our method appeared to be more effective with more detailed classification. Experiments in two short text datasets demonstrate that our approach achieved improvement compared with the state-of-the-art methods including support vector machine (SVM) and Na < ve Bayes Multinomial.
引用
收藏
页码:649 / 659
页数:11
相关论文
共 50 条
  • [21] Feature Extension for Chinese Short Text Classification Based on LDA and Word2vec
    Sun, Fanke
    Chen, Heping
    [J]. PROCEEDINGS OF THE 2018 13TH IEEE CONFERENCE ON INDUSTRIAL ELECTRONICS AND APPLICATIONS (ICIEA 2018), 2018, : 1189 - 1194
  • [22] Improving Persian Text Classification Using Persian Thesaurus
    Parvin, Hamid
    Minaei-Bidgoli, Behrouz
    Dahbashi, Atousa
    [J]. PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS, COMPUTER VISION, AND APPLICATIONS, 2011, 7042 : 391 - 398
  • [23] Improving Short Text Classification through Better Feature Space Selection
    Wang, Meng
    Lin, Lanfen
    Wang, Feng
    [J]. 2013 9TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND SECURITY (CIS), 2013, : 120 - 124
  • [24] Short Text Classification Based on Keywords Extension
    Gu, Yiran
    Shen, Jiajia
    [J]. 2019 CHINESE AUTOMATION CONGRESS (CAC2019), 2019, : 2616 - 2621
  • [25] Wikipedia Based Short Text Classification Method
    Li, Junze
    Cai, Yi
    Cai, Zhiwei
    Leung, Hofung
    Yang, Kai
    [J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS (DASFAA 2017), 2017, 10179 : 275 - 286
  • [26] Text Relatedness Based on a Word Thesaurus
    Tsatsaronis, George
    Varlamis, Iraklis
    Vazirgiannis, Michalis
    [J]. JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2010, 37 : 1 - 39
  • [27] Utility-based feature selection for text classification
    Heyong Wang
    Ming Hong
    Raymond Yiu Keung Lau
    [J]. Knowledge and Information Systems, 2019, 61 : 197 - 226
  • [28] A Kernel-based Feature Weighting for Text Classification
    Wittek, Peter
    Tan, Chew Lim
    [J]. IJCNN: 2009 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOLS 1- 6, 2009, : 3062 - 3068
  • [29] Utility-based feature selection for text classification
    Wang, Heyong
    Hong, Ming
    Lau, Raymond Yiu Keung
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2019, 61 (01) : 197 - 226
  • [30] Text Classification via iVector Based Feature Representation
    Zha, Shengxin
    Peng, Xujun
    Cao, Huaigu
    Zhuang, Xiaodan
    Natarajan, Pradeep
    Natarajan, Prem
    [J]. 2014 11TH IAPR INTERNATIONAL WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS (DAS 2014), 2014, : 151 - 155