Short text classification based on strong feature thesaurus

被引:31
|
作者
Wang, Bing-kun [1 ,2 ]
Huang, Yong-feng [1 ,2 ]
Yang, Wan-xia [1 ,2 ]
Li, Xing [1 ,2 ]
机构
[1] Tsinghua Univ, Dept Elect & Engn, Informat Cognit & Intelligent Syst Res Inst, Beijing 100084, Peoples R China
[2] Tsinghua Univ, Informat Technol Natl Lab, Beijing 100084, Peoples R China
关键词
Short text; Classification; Data sparseness; Semantic; Strong feature thesaurus (SFT); Latent Dirichlet allocation (LDA);
D O I
10.1631/jzus.C1100373
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Data sparseness, the evident characteristic of short text, has always been regarded as the main cause of the low accuracy in the classification of short texts using statistical methods. Intensive research has been conducted in this area during the past decade. However, most researchers failed to notice that ignoring the semantic importance of certain feature terms might also contribute to low classification accuracy. In this paper we present a new method to tackle the problem by building a strong feature thesaurus (SFT) based on latent Dirichlet allocation (LDA) and information gain (IG) models. By giving larger weights to feature terms in SFT, the classification accuracy can be improved. Specifically, our method appeared to be more effective with more detailed classification. Experiments in two short text datasets demonstrate that our approach achieved improvement compared with the state-of-the-art methods including support vector machine (SVM) and Na < ve Bayes Multinomial.
引用
收藏
页码:649 / 659
页数:11
相关论文
共 50 条
  • [31] Tibetan Text Classification Based on the Feature of Position Weight
    Cao, Hui
    Jia, Huiqiang
    [J]. 2013 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP 2013), 2013, : 220 - 223
  • [32] A THESAURUS-GUIDED TEXT ANALYTICS TECHNIQUE FOR CAPABILITY BASED CLASSIFICATION OF MANUFACTURING SUPPLIERS
    Sabbagh, Ramin
    Ameri, Farhad
    [J]. PROCEEDINGS OF THE ASME INTERNATIONAL DESIGN ENGINEERING TECHNICAL CONFERENCES AND COMPUTERS AND INFORMATION IN ENGINEERING CONFERENCE, 2017, VOL 1, 2017,
  • [33] Text classification based on feature selection and LDA model
    [J]. Zheng, C. (csahu@126.com), 1600, Binary Information Press, P.O. Box 162, Bethel, CT 06801-0162, United States (09):
  • [34] Feature-Based Subjectivity Classification of Filipino Text
    Regalado, Ralph Vincent J.
    Cheng, Charibeth K.
    [J]. 2012 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP 2012), 2012, : 57 - 60
  • [35] Lexicon based feature extraction for emotion text classification
    Bandhakavi, Anil
    Wiratunga, Nirmalie
    Padmanabhan, Deepak
    Massie, Stewart
    [J]. PATTERN RECOGNITION LETTERS, 2017, 93 : 133 - 142
  • [36] A text classification algorithm based on feature library projection
    Yin S.
    Zheng H.
    Xu S.
    Rong H.
    Zhang N.
    [J]. Zhongnan Daxue Xuebao (Ziran Kexue Ban)/Journal of Central South University (Science and Technology), 2017, 48 (07): : 1782 - 1789
  • [37] Feature Word Vector Based on Short Text Clustering
    Liu, Xin
    Wang, Bo
    Xi, Yao-yi
    Mao, Er-song
    Ke, Sheng-cai
    Tang, Yong-wang
    [J]. COMPUTER SCIENCE AND TECHNOLOGY (CST2016), 2017, : 533 - 545
  • [38] Association Rules Based Short Text Feature Extension
    Huang Wei
    Li Shan-Fei
    Tan Yue-Jin
    Gao Bing
    [J]. INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2009, 9 (10): : 227 - 230
  • [39] Improving Persian Text Classification and Clustering Using Persian Thesaurus
    Parvin, Hamid
    Dahbashi, Atousa
    Parvin, Sajad
    Minaei-Bidgoli, Behrouz
    [J]. DISTRIBUTED COMPUTING AND ARTIFICIAL INTELLIGENCE, 2012, 151 : 493 - 500
  • [40] Text classification framework for short text based on TFIDF-FastText
    Shrutika Chawla
    Ravreet Kaur
    Preeti Aggarwal
    [J]. Multimedia Tools and Applications, 2023, 82 : 40167 - 40180