N-grams based feature selection and text representation for Chinese Text Classification

被引:0
|
作者
Zhihua Wei
Duoqian Miao
Jean Hugues Chauchat
Rui Zhao
Wen Li
机构
[1] Tongji University,Department of Computer Science and Engineering
[2] Tongji University,Key laboratory “Embedded System and Service Computing” Ministry of Education
[3] Université de Lyon,undefined
[4] Laboratoire ERIC-Lyon2,undefined
关键词
Chinese text classification; n-gram; feature selection; text representation weight;
D O I
10.2991/ijcis.2009.2.4.5
中图分类号
学科分类号
摘要
In this paper, text representation and feature selection strategies for Chinese text classification based on n-grams are discussed. Two steps feature selection strategy is proposed which combines the preprocess within classes with the feature selection among classes. Four different feature selection methods and three text representation weights are compared by exhaustive experiments. Both C-SVC classifier and Naive bayes classifier are adopted to assess the results. All experiments are performed on Chinese corpus TanCorpV1.0 which includes more than 14,000 texts divided in 12 classes. Our experiments concern: (1) the performance comparison among different feature selection strategies: absolute text frequency, relative text frequency, absolute n-gram frequency and relative n-gram frequency; (2) the comparison of the sparseness and feature correlation in the “text by feature” matrices produced by four feature selection methods; (3) the performance comparison among three term weights: 0/1 logical value, n-gram frequency numeric value (TF) and Tf*idf value.
引用
收藏
页码:365 / 374
页数:9
相关论文
共 50 条
  • [1] N-grams based feature selection and text representation for Chinese text classification
    Department of Computer Science and Engineering, Tongji University, Cao'an Road, 4800, Shanghai, 201804, China
    不详
    不详
    [J]. Int. J. Comput. Intell. Syst., 2009, 4 (365-374):
  • [2] N-grams based feature selection and text representation for Chinese Text Classification
    Wei, Zhihua
    Miao, Duoqian
    Chauchat, Jean-Hugues
    Zhao, Rui
    Li, Wen
    [J]. INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE SYSTEMS, 2009, 2 (04) : 365 - 374
  • [3] Feature selection on Chinese text classification using character n-grams
    Wei, Zhihua
    Miao, Duoqian
    Chauchat, Jean-Hugues
    Zhong, Caiming
    [J]. ROUGH SETS AND KNOWLEDGE TECHNOLOGY, 2008, 5009 : 500 - +
  • [4] Feature Extension for Chinese Short Text Classification Based on Topical N-Grams
    Sun, Baoshan
    Zhao, Peng
    [J]. 2017 16TH IEEE/ACIS INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION SCIENCE (ICIS 2017), 2017, : 477 - 482
  • [5] Sentence Classification Using N-Grams in Urdu Language Text
    Awan, Malik Daler Ali
    Ali, Sikandar
    Samad, Ali
    Iqbal, Nadeem
    Missen, Malik Muhammad Saad
    Ullah, Niamat
    [J]. SCIENTIFIC PROGRAMMING, 2021, 2021
  • [6] Using Word N-Grams as Features in Arabic Text Classification
    Al-Thubaity, Abdulmohsen
    Alhoshan, Muneera
    Hazzaa, Itisam
    [J]. SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING, 2015, 569 : 35 - 43
  • [7] Text classification and multilinguism: Getting at words via N-grams of characters
    Biskri, I
    Delisle, S
    [J]. 6TH WORLD MULTICONFERENCE ON SYSTEMICS, CYBERNETICS AND INFORMATICS, VOL V, PROCEEDINGS: COMPUTER SCI I, 2002, : 110 - 115
  • [8] Crowd Sourcing as an Improvement of N-Grams Text Document Classification Algorithm
    Saloun, Petr
    Andrsic, David
    Cigankova, Barbora
    Anagnostopoulos, Ioannis
    [J]. 2020 15TH INTERNATIONAL WORKSHOP ON SEMANTIC AND SOCIAL MEDIA ADAPTATION AND PERSONALIZATION (SMAP 2020), 2020, : 162 - 167
  • [9] Using N-grams for arabic text searching
    Mustafa, SH
    Al-Radaideh, QA
    [J]. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2004, 55 (11): : 1002 - 1007
  • [10] Cluster Based Symbolic Representation and Feature Selection for Text Classification
    Harish, B. S.
    Guru, D. S.
    Manjunath, S.
    Dinesh, R.
    [J]. ADVANCED DATA MINING AND APPLICATIONS (ADMA 2010), PT II, 2010, 6441 : 158 - 166