Feature Extension for Chinese Short Text Classification Based on Topical N-Grams

被引:0
|
作者
Sun, Baoshan [1 ]
Zhao, Peng [1 ]
机构
[1] Tianjin Polytech Univ, Sch Comp Sci & Software Engn, Tianjin, Peoples R China
基金
中国国家自然科学基金;
关键词
Topical N-Grams; LDA; Short Texts Classification; Feature Extension; SVM;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Because of the feature sparseness problem, conventional text classification methods hardly achieve a good effect on short texts. This paper presents a novel feature extension method based on the TNG model to solve this problem. This algorithm can infers not only the unigram words distribution but also the phrases distribution on each topic. We can build a feature extension library using TNG algorithm. Base on the original features in short texts, we can compute the topic tendency for each of these texts. According to the topic tendency, the appropriate candidate words and phrases are selected from the feature extension library. And then these candidate words and phrases are put into original short texts. After extending features, we use the LDA and SVM algorithm to classify these expanded short texts and use precision, recall and F1-score to evaluate the effect of classification. The result shows that our method can significantly improve classification performance.
引用
收藏
页码:477 / 482
页数:6
相关论文
共 50 条
  • [31] Better text compression from fewer lexical n-grams
    Smith, TC
    Lorenz, M
    DCC 2001: DATA COMPRESSION CONFERENCE, PROCEEDINGS, 2001, : 516 - 516
  • [32] Towards an automatic classification of images: Approach by the n-grams
    Laouamer, Lamri
    Biskri, Ismail
    Houmadi, Benamar
    WMSCI 2005: 9th World Multi-Conference on Systemics, Cybernetics and Informatics, Vol 3, 2005, : 73 - 78
  • [33] Composer classification using melodic combinatorial n-grams
    Alvarez, Daniel Alejandro Perez
    Gelbukh, Alexander
    Sidorov, Grigori
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 249
  • [34] CONTINUOUS MODELS OF AFFECT FROM TEXT USING N-GRAMS
    Malandrakis, Nikolaos
    Potamianos, Alexandros
    Narayanan, Shrikanth
    2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 8500 - 8504
  • [35] Probabilistic retrieval of OCR degraded text using N-grams
    Harding, SM
    Croft, WB
    Weir, C
    RESEARCH AND ADVANCED TECHNOLOGY FOR DIGITAL LIBRARIES, 1997, 1324 : 345 - 359
  • [36] UNORDERED N-GRAMS: NEW APPROACH IN TEXT PLAGIARISM DETECTION
    Pribil, Jiri
    Leseticky, Ondrej
    Kubalova, Kamila
    INFORMATION TECHNOLOGIES' 2009, 2009, : 243 - 249
  • [37] Error Classification Using Automatic Measures Based on n-grams and Edit Distance
    Benko, L'ubomir
    Benkova, Lucia
    Munkova, Dasa
    Munk, Michal
    Shulzenko, Danylo
    ADVANCED RESEARCH IN TECHNOLOGIES, INFORMATION, INNOVATION AND SUSTAINABILITY, ARTIIS 2022, PT I, 2022, 1675 : 345 - 356
  • [38] A CNN based approach to Phrase-Labelling through classification of N-Grams
    Choudhary, Chinmay
    O'Riordan, Colm
    PROCEEDINGS OF THE 11TH ANNUAL MEETING OF THE FORUM FOR INFORMATION RETRIEVAL EVALUATION (FIRE 2019), 2019, : 18 - 23
  • [39] Topical n-grams: Phrase and topic discovery, with an application to information retrieval
    Wang, Xuerui
    McCallum, Andrew
    Wei, Xing
    ICDM 2007: PROCEEDINGS OF THE SEVENTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, 2007, : 697 - 702
  • [40] Automatic statistical translation based on n-grams
    Oliver, Antonio
    Badia, Toni
    Boleda, Gemma
    Melero, Maite
    PROCESAMIENTO DEL LENGUAJE NATURAL, 2005, (35): : 77 - 84