Using chi-square statistics to measure similarities for text categorization

被引:69
|
作者
Chen, Yao-Tsung [1 ]
Chen, Meng Chang [2 ]
机构
[1] Natl Penghu Univ Sci & Technol, Dept Comp Sci & Informat Engn, Makung City 880, Penghu, Taiwan
[2] Acad Sinica, Inst Informat Sci, Taipei, Taiwan
关键词
Nonparametric statistics; Text mining; Machine learning;
D O I
10.1016/j.eswa.2010.08.100
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we propose using chi-square statistics to measure similarities and chi-square tests to determine the homogeneity of two random samples of term vectors for text categorization. The properties of chi-square tests for text categorization are studied first. One of the advantages of chi-square test is that its significance level is similar to the miss rate that provides a foundation for theoretical performance (i.e. miss rate) guarantee. Generally a classifier using cosine similarities with IF I* IDF performs reasonably well in text categorization. However, its performance may fluctuate even near the optimal threshold value. To improve the limitation, we propose the combined usage of chi-square statistics and cosine similarities. Extensive experiment results verify properties of chi-square tests and performance of the combined usage. (C) 2010 Elsevier Ltd. All rights reserved.
引用
收藏
页码:3085 / 3090
页数:6
相关论文
共 50 条
  • [1] Chi-square Statistics Feature Selection Based on Term Frequency and Distribution for Text Categorization
    Jin, Chuanxin
    Ma, Tinghuai
    Hou, Rongtao
    Tang, Meili
    Tian, Yuan
    Al-Dhelaan, Abdullah
    Al-Rodhaan, Mznah
    [J]. IETE JOURNAL OF RESEARCH, 2015, 61 (04) : 351 - 362
  • [2] A FAST CHI-SQUARE BASED ALGORITHM FOR TEXT CATEGORIZATION OF MEDLINE CITATIONS
    Kastrin, Andrej
    Peterlin, Borut
    Hristovski, Dimitar
    [J]. IUBMB LIFE, 2009, 61 (03) : 326 - 326
  • [3] Chi-square classifier for document categorization
    Alexandrov, M
    Gelbukh, A
    Lozovoi, G
    [J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, 2001, 2004 : 457 - 459
  • [4] A Chi-square Statistics Based Feature Selection Method in Text Classification
    Zhai, Yujia
    Song, Wei
    Liu, Xianjun
    Liu, Lizhen
    Zhao, Xinlei
    [J]. PROCEEDINGS OF 2018 IEEE 9TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING AND SERVICE SCIENCE (ICSESS), 2018, : 160 - 163
  • [5] ON MAXIMALLY SELECTED CHI-SQUARE STATISTICS
    KOZIOL, JA
    [J]. BIOMETRICS, 1991, 47 (04) : 1557 - 1561
  • [6] MAXIMALLY SELECTED CHI-SQUARE STATISTICS
    MILLER, R
    SIEGMUND, D
    [J]. BIOMETRICS, 1982, 38 (04) : 1011 - 1016
  • [7] CHI-SQUARE NOT A MEASURE OF DEGREE OF RELATIONSHIP
    VANCE, FL
    [J]. PERSONNEL AND GUIDANCE JOURNAL, 1964, 42 (08): : 818 - 819
  • [8] Expansions for the distribution of asymptotically chi-square statistics
    Withers, Christopher S.
    Nadarajah, Saralees
    [J]. STATISTICAL METHODOLOGY, 2013, 12 : 16 - 30
  • [9] Feature selection using an improved Chi-square for Arabic text classification
    Bahassine, Said
    Madani, Abdellah
    Al-Sarem, Mohammed
    Kissi, Mohamed
    [J]. JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2020, 32 (02) : 225 - 231
  • [10] MINIMUM CHI-SQUARE STATISTICS IN CONTINGENCY-TABLES
    QUADE, D
    SALAMA, IA
    [J]. BIOMETRICS, 1975, 31 (04) : 953 - 956