Feature Extension for Chinese Short Text Classification Based on Topical N-Grams

被引:0
|
作者
Sun, Baoshan [1 ]
Zhao, Peng [1 ]
机构
[1] Tianjin Polytech Univ, Sch Comp Sci & Software Engn, Tianjin, Peoples R China
基金
中国国家自然科学基金;
关键词
Topical N-Grams; LDA; Short Texts Classification; Feature Extension; SVM;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Because of the feature sparseness problem, conventional text classification methods hardly achieve a good effect on short texts. This paper presents a novel feature extension method based on the TNG model to solve this problem. This algorithm can infers not only the unigram words distribution but also the phrases distribution on each topic. We can build a feature extension library using TNG algorithm. Base on the original features in short texts, we can compute the topic tendency for each of these texts. According to the topic tendency, the appropriate candidate words and phrases are selected from the feature extension library. And then these candidate words and phrases are put into original short texts. After extending features, we use the LDA and SVM algorithm to classify these expanded short texts and use precision, recall and F1-score to evaluate the effect of classification. The result shows that our method can significantly improve classification performance.
引用
收藏
页码:477 / 482
页数:6
相关论文
共 50 条
  • [1] N-grams based feature selection and text representation for Chinese text classification
    Department of Computer Science and Engineering, Tongji University, Cao'an Road, 4800, Shanghai, 201804, China
    不详
    不详
    Int. J. Comput. Intell. Syst., 2009, 4 (365-374):
  • [2] N-grams based feature selection and text representation for Chinese Text Classification
    Zhihua Wei
    Duoqian Miao
    Jean Hugues Chauchat
    Rui Zhao
    Wen Li
    International Journal of Computational Intelligence Systems, 2009, 2 (4) : 365 - 374
  • [3] N-grams based feature selection and text representation for Chinese Text Classification
    Wei, Zhihua
    Miao, Duoqian
    Chauchat, Jean-Hugues
    Zhao, Rui
    Li, Wen
    INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE SYSTEMS, 2009, 2 (04) : 365 - 374
  • [4] Feature selection on Chinese text classification using character n-grams
    Wei, Zhihua
    Miao, Duoqian
    Chauchat, Jean-Hugues
    Zhong, Caiming
    ROUGH SETS AND KNOWLEDGE TECHNOLOGY, 2008, 5009 : 500 - +
  • [5] Hierarchical classification of Chinese documents based on N-grams
    Guan, JH
    Zhou, SG
    DIGITAL LIBRARIES: TECHNOLOGY AND MANAGEMENT OF INDIGENOUS KNOWLEDGE FOR GLOBAL ACCESS, 2003, 2911 : 643 - 652
  • [6] A Pseudo-document-based Topical N-grams model for short texts
    Lin, Hao
    Zuo, Yuan
    Liu, Guannan
    Li, Hong
    Wu, Junjie
    Wu, Zhiang
    WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2020, 23 (06): : 3001 - 3023
  • [7] A Pseudo-document-based Topical N-grams model for short texts
    Hao Lin
    Yuan Zuo
    Guannan Liu
    Hong Li
    Junjie Wu
    Zhiang Wu
    World Wide Web, 2020, 23 : 3001 - 3023
  • [8] Sentence Classification Using N-Grams in Urdu Language Text
    Awan, Malik Daler Ali
    Ali, Sikandar
    Samad, Ali
    Iqbal, Nadeem
    Missen, Malik Muhammad Saad
    Ullah, Niamat
    SCIENTIFIC PROGRAMMING, 2021, 2021
  • [9] Using Word N-Grams as Features in Arabic Text Classification
    Al-Thubaity, Abdulmohsen
    Alhoshan, Muneera
    Hazzaa, Itisam
    SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING, 2015, 569 : 35 - 43
  • [10] Text classification and multilinguism: Getting at words via N-grams of characters
    Biskri, I
    Delisle, S
    6TH WORLD MULTICONFERENCE ON SYSTEMICS, CYBERNETICS AND INFORMATICS, VOL V, PROCEEDINGS: COMPUTER SCI I, 2002, : 110 - 115