Mutual Information Using Sample Variance for Text Feature Selection

被引:7
|
作者
Agnihotri, Deepak [1 ]
Verma, Kesari [1 ]
Tripathi, Priyanka [2 ]
机构
[1] Natl Inst Technol, Dept Comp Applicat, Raipur 492010, CG, India
[2] Natl Inst Tech Teachers Training & Res Bhopal, Dept Comp Engn & Applicat, Bhopal, MP, India
关键词
Feature selection; Text Classification; Term Frequency; Text Analysis; Text Mining; Information Retrieval; SCHEME;
D O I
10.1145/3162957.3163054
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Feature selection improves the training speed of the classifier without affecting its predictive capability. It selects a subset of most informative words (terms) from the set of all words. Term distribution affects the feature selection process, e.g. an even distribution of terms in a specific class ensures a higher association of these terms with that class, but an even distribution in almost classes shows a lesser association. This paper computes sample variance using standard Mutual Information (MI) method to measure the variations in distribution of terms. MI method assigns a higher rank to the terms distributed in a specific category (i.e. rare terms) which shows it strong influence with the rare terms than common terms (i.e. terms which most frequently in almost classes). To address this issue, a new text feature selection method named Mutual Information Using Sample Variance (MIUSV) is proposed in this paper. It considers sample variance in term distribution while computing the Mutual Information score of the term. Multinomial Naive Bayes (MNB) and k Nearest Neighbor (kNN) classifiers model, check the utilities of the selected terms by the proposed MIUSV. These models classify four standard text data sets, viz. Webkb, 20Newsgroup, Ohsumed10, and Ohsumed23. Two standard performance measures named Macro-F1 and Micro-F1 show a significant improvement in the results using proposed MIUSV method.
引用
收藏
页码:39 / 44
页数:6
相关论文
共 50 条
  • [41] Is mutual information adequate for feature selection in regression?
    Frenay, Benoit
    Doquire, Gauthier
    Verleysen, Michel
    [J]. NEURAL NETWORKS, 2013, 48 : 1 - 7
  • [42] A wrapper for feature selection based on mutual information
    Huang, Jinjie
    Cai, Yunze
    Xu, Xiaoming
    [J]. 18TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 2, PROCEEDINGS, 2006, : 618 - +
  • [43] Gait Feature Subset Selection by Mutual Information
    Guo, Baofeng
    Nixon, Mark S.
    [J]. IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART A-SYSTEMS AND HUMANS, 2009, 39 (01): : 36 - 46
  • [44] FEATURE SELECTION WITH WEIGHTED CONDITIONAL MUTUAL INFORMATION
    Celik, Ceyhun
    Bilge, Hasan Sakir
    [J]. JOURNAL OF THE FACULTY OF ENGINEERING AND ARCHITECTURE OF GAZI UNIVERSITY, 2015, 30 (04): : 585 - 596
  • [45] Feature Selection with Mutual Information for Regression Problems
    Sulaiman, Muhammad Aliyu
    Labadin, Jane
    [J]. 2015 9TH INTERNATIONAL CONFERENCE ON IT IN ASIA (CITA), 2015,
  • [46] Active Feature Selection for the Mutual Information Criterion
    Schnapp, Shachar
    Sabato, Sivan
    [J]. THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 9497 - 9504
  • [47] Mutual information for feature selection: estimation or counting?
    Nguyen H.B.
    Xue B.
    Andreae P.
    [J]. Evolutionary Intelligence, 2016, 9 (3) : 95 - 110
  • [48] Study of E-mail Filtering Based on Mutual Information Text Feature Selection Method
    Gong, Shangfu
    Gong, Xingyu
    Wang, Yuan
    [J]. INSTRUMENTATION, MEASUREMENT, CIRCUITS AND SYSTEMS, 2012, 127 : 33 - 39
  • [49] Feature Selection by Maximizing Part Mutual Information
    Gao, Wanfu
    Hu, Liang
    Zhang, Ping
    [J]. 2018 INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND MACHINE LEARNING (SPML 2018), 2018, : 120 - 127
  • [50] Genetic algorithm for feature selection with mutual information
    Ge, Hong
    Hu, Tianliang
    [J]. 2014 SEVENTH INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND DESIGN (ISCID 2014), VOL 1, 2014, : 116 - 119