Mutual Information Using Sample Variance for Text Feature Selection

被引:7
|
作者
Agnihotri, Deepak [1 ]
Verma, Kesari [1 ]
Tripathi, Priyanka [2 ]
机构
[1] Natl Inst Technol, Dept Comp Applicat, Raipur 492010, CG, India
[2] Natl Inst Tech Teachers Training & Res Bhopal, Dept Comp Engn & Applicat, Bhopal, MP, India
关键词
Feature selection; Text Classification; Term Frequency; Text Analysis; Text Mining; Information Retrieval; SCHEME;
D O I
10.1145/3162957.3163054
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Feature selection improves the training speed of the classifier without affecting its predictive capability. It selects a subset of most informative words (terms) from the set of all words. Term distribution affects the feature selection process, e.g. an even distribution of terms in a specific class ensures a higher association of these terms with that class, but an even distribution in almost classes shows a lesser association. This paper computes sample variance using standard Mutual Information (MI) method to measure the variations in distribution of terms. MI method assigns a higher rank to the terms distributed in a specific category (i.e. rare terms) which shows it strong influence with the rare terms than common terms (i.e. terms which most frequently in almost classes). To address this issue, a new text feature selection method named Mutual Information Using Sample Variance (MIUSV) is proposed in this paper. It considers sample variance in term distribution while computing the Mutual Information score of the term. Multinomial Naive Bayes (MNB) and k Nearest Neighbor (kNN) classifiers model, check the utilities of the selected terms by the proposed MIUSV. These models classify four standard text data sets, viz. Webkb, 20Newsgroup, Ohsumed10, and Ohsumed23. Two standard performance measures named Macro-F1 and Micro-F1 show a significant improvement in the results using proposed MIUSV method.
引用
收藏
页码:39 / 44
页数:6
相关论文
共 50 条
  • [1] Feature Selection for Text Classification Using Mutual Information
    Sel, Ilhami
    Karci, Ali
    Hanbay, Davut
    [J]. 2019 INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND DATA PROCESSING (IDAP 2019), 2019,
  • [2] Feature selection using improved mutual information for text classification
    Novovicová, J
    Malík, A
    Pudil, P
    [J]. STRUCTURAL, SYNTACTIC, AND STATISTICAL PATTERN RECOGNITION, PROCEEDINGS, 2004, 3138 : 1010 - 1017
  • [3] Discriminant Mutual Information for Text Feature Selection
    Wang, Jiaqi
    Zhang, Li
    [J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS (DASFAA 2021), PT II, 2021, 12682 : 136 - 151
  • [4] Improved Mutual Information Method For Text Feature Selection
    Ding Xiaoming
    Tang Yan
    [J]. PROCEEDINGS OF THE 2013 8TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE & EDUCATION (ICCSE 2013), 2013, : 163 - 166
  • [5] Feature selection algorithm for text classification based on improved mutual information
    丛帅
    张积宾
    徐志明
    王宇颖
    [J]. Journal of Harbin Institute of Technology(New series), 2011, (03) : 144 - 148
  • [6] Weighted average pointwise mutual information for feature selection in text categorization
    Schneider, KM
    [J]. KNOWLEDGE DISCOVERY IN DATABASES: PKDD 2005, 2005, 3721 : 252 - 263
  • [7] Feature selection using a mutual information based measure
    Al-Ani, A
    Deriche, M
    [J]. 16TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITON, VOL IV, PROCEEDINGS, 2002, : 82 - 85
  • [8] Feature selection using mutual information in CT colonography
    Ong, Ju Lynn
    Seghouane, Abd-Krim
    [J]. PATTERN RECOGNITION LETTERS, 2011, 32 (02) : 337 - 341
  • [9] Using Mutual Information for Feature Selection in Programmatic Advertising
    Ciesielczyk, Michal
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON INNOVATIONS IN INTELLIGENT SYSTEMS AND APPLICATIONS (INISTA), 2017, : 290 - 295
  • [10] Feature selection using Joint Mutual Information Maximisation
    Bennasar, Mohamed
    Hicks, Yulia
    Setchi, Rossitza
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2015, 42 (22) : 8520 - 8532