Chi-square Statistics Feature Selection Based on Term Frequency and Distribution for Text Categorization

被引:34
|
作者
Jin, Chuanxin [1 ]
Ma, Tinghuai [1 ,2 ]
Hou, Rongtao [1 ]
Tang, Meili [3 ]
Tian, Yuan [4 ]
Al-Dhelaan, Abdullah [4 ]
Al-Rodhaan, Mznah [4 ]
机构
[1] Informat Sci & Technol Univ, Nanjing Univ, Sch Comp & Software, Nanjing, Jiangsu, Peoples R China
[2] Nanjing Univ Informat Sci & Technol, Jiangsu Engn Ctr Network Monitoring, Nanjing, Jiangsu, Peoples R China
[3] Nanjing Univ Informat Sci Technol, Sch Publ Adm, Nanjing, Jiangsu, Peoples R China
[4] King Saud Univ, Com Sci Dept, Riyadh, Saudi Arabia
基金
美国国家科学基金会; 中国博士后科学基金;
关键词
Term frequency; Chi-square statistics; Text categorization; Feature selection; Difference degree of distribution; INFORMATION; CLASSIFIER; CATEGORY;
D O I
10.1080/03772063.2015.1021385
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Text categorization (TC) becomes the key technology to find relevant and timely information from a volume of digital documents, and feature selection techniques are proposed to overcome the high dimensionality which causes the high computational complexity and low accuracy in TC tasks. Chi-square statistics (CHI) is one of the most efficient feature selection methods; however, it has two weaknesses. (1) It is document frequency based, and only counts whether the term occurs or not. Actually, high-frequency term occurring in few documents is often regarded as a discriminator in corpus. (2) It does not consider the term distribution. A term has more discriminating power for a specific category when its difference in degree of distribution is lower. In this paper, we propose a modified CHI feature selection approach which is called term frequency and distribution based CHI to overcome these weaknesses. We use sample variance to calculate the term distribution, and improve the classic CHI with maximum term frequency. Extensive and comparative experiments on three corpora show that the proposed approach is comparable to the classic feature selection methods in terms of macro-F1 and micro-F1.
引用
收藏
页码:351 / 362
页数:12
相关论文
共 50 条
  • [1] A Chi-square Statistics Based Feature Selection Method in Text Classification
    Zhai, Yujia
    Song, Wei
    Liu, Xianjun
    Liu, Lizhen
    Zhao, Xinlei
    [J]. PROCEEDINGS OF 2018 IEEE 9TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING AND SERVICE SCIENCE (ICSESS), 2018, : 160 - 163
  • [2] Using chi-square statistics to measure similarities for text categorization
    Chen, Yao-Tsung
    Chen, Meng Chang
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2011, 38 (04) : 3085 - 3090
  • [3] An Improved Native Bayes Classifier for Imbalanced Text Categorization Based on K-means and CHI-square Feature Selection
    Meng Fanbo
    Xu Linying
    [J]. 2018 EIGHTH INTERNATIONAL CONFERENCE ON INSTRUMENTATION AND MEASUREMENT, COMPUTER, COMMUNICATION AND CONTROL (IMCCC 2018), 2018, : 894 - 898
  • [4] Feature selection using an improved Chi-square for Arabic text classification
    Bahassine, Said
    Madani, Abdellah
    Al-Sarem, Mohammed
    Kissi, Mohamed
    [J]. JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2020, 32 (02) : 225 - 231
  • [5] A FAST CHI-SQUARE BASED ALGORITHM FOR TEXT CATEGORIZATION OF MEDLINE CITATIONS
    Kastrin, Andrej
    Peterlin, Borut
    Hristovski, Dimitar
    [J]. IUBMB LIFE, 2009, 61 (03) : 326 - 326
  • [6] Expansions for the distribution of asymptotically chi-square statistics
    Withers, Christopher S.
    Nadarajah, Saralees
    [J]. STATISTICAL METHODOLOGY, 2013, 12 : 16 - 30
  • [7] Relative term-frequency based feature selection for text categorization
    Yang, SM
    Wu, XB
    Deng, ZH
    Zhang, M
    Yang, DQ
    [J]. 2002 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-4, PROCEEDINGS, 2002, : 1432 - 1436
  • [8] Properties of chi-square statistic and information gain for feature selection of imbalanced text data
    Mun, Hye In
    Son, Won
    [J]. KOREAN JOURNAL OF APPLIED STATISTICS, 2022, 35 (04) : 469 - 484
  • [9] Comparison of term frequency and document frequency based feature selection metrics in text categorization
    Azam, Nouman
    Yao, JingTao
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2012, 39 (05) : 4760 - 4768
  • [10] Chi-square classifier for document categorization
    Alexandrov, M
    Gelbukh, A
    Lozovoi, G
    [J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, 2001, 2004 : 457 - 459