Feature selection via maximizing global information gain for text classification

被引:96
|
作者
Shang, Changxing [1 ,2 ,3 ]
Li, Min [1 ,2 ]
Feng, Shengzhong [1 ]
Jiang, Qingshan [1 ]
Fan, Jianping [1 ]
机构
[1] Chinese Acad Sci, Shenzhen Inst Adv Technol, Shenzhen 518055, Peoples R China
[2] Chinese Acad Sci, Grad Sch, Beijing 100080, Peoples R China
[3] Zhengzhou Inst Informat Sci & Technol, Zhengzhou 450001, Peoples R China
基金
美国国家科学基金会;
关键词
Feature selection; Text classification; High dimensionality; Distributional clustering; Information bottleneck;
D O I
10.1016/j.knosys.2013.09.019
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Feature selection is a vital preprocessing step for text classification task used to solve the curse of dimensionality problem. Most existing metrics (such as information gain) only evaluate features individually but completely ignore the redundancy between them. This can decrease the overall discriminative power because one feature's predictive power is weakened by others. On the other hand, though all higher order algorithms (such as mRMR) take redundancy into account, the high computational complexity renders them improper in the text domain. This paper proposes a novel metric called global information gain (GIG) which can avoid redundancy naturally. An efficient feature selection method called maximizing global information gain (MGIG) is also given. We compare MGIG with four other algorithms on six datasets, the experimental results show that MGIG has better results than others methods in most cases. Moreover, MGIG runs significantly faster than the traditional higher order algorithms, which makes it a proper choice for feature selection in text domain. (C) 2013 Elsevier B.V. All rights reserved.
引用
收藏
页码:298 / 309
页数:12
相关论文
共 50 条
  • [31] Feature Selection for Ordinal Text Classification
    Baccianella, Stefano
    Esuli, Andrea
    Sebastiani, Fabrizio
    NEURAL COMPUTATION, 2014, 26 (03) : 557 - 591
  • [32] Feature Selection Methods for Text Classification
    Dasgupta, Anirban
    Drineas, Petros
    Harb, Boulos
    Josifovski, Vanja
    Mahoney, Michael W.
    KDD-2007 PROCEEDINGS OF THE THIRTEENTH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2007, : 230 - +
  • [33] Feature Selection via Maximizing Fuzzy Dependency
    Hu, Qinghua
    Zhu, Pengfei
    Liu, Jinfu
    Yang, Yongbin
    Yu, Daren
    FUNDAMENTA INFORMATICAE, 2010, 98 (2-3) : 167 - 181
  • [34] A Review on Feature Selection and Feature Extraction for Text Classification
    Shah, Foram P.
    Patel, Vibha
    PROCEEDINGS OF THE 2016 IEEE INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, SIGNAL PROCESSING AND NETWORKING (WISPNET), 2016, : 2264 - 2268
  • [35] Comparing PCA to Information Gain as a Feature Selection Method for Influenza-A Classification
    Shaltout, Nemin
    Moustafa, Mohamed
    Rafea, Ahmed
    Moustafa, Ahmed
    ElHefnawi, Mohamed
    2015 INTERNATIONAL CONFERENCE ON INTELLIGENT INFORMATICS AND BIOMEDICAL SCIENCES (ICIIBMS), 2015, : 279 - 283
  • [36] Iterative Feature Selection using Information Gain & Naive Bayes for Document Classification
    Rahman, Chowdhury Mofizur
    Afroze, Lameya
    Refath, Naznin Sultana
    Shawon, Nafin
    2018 21ST INTERNATIONAL CONFERENCE OF COMPUTER AND INFORMATION TECHNOLOGY (ICCIT), 2018,
  • [37] IMPROVED TEXT FEATURE SELECTION ALGORITHMS IN CLASSIFICATION SEARCH OF ENVIRONMENTAL PROTECTION INFORMATION
    Yang, Rongjie
    Man, Shuai
    JOURNAL OF ENVIRONMENTAL PROTECTION AND ECOLOGY, 2019, 20 (03): : 1462 - 1469
  • [38] Modified Pointwise Mutual Information-Based Feature Selection for Text Classification
    Georgieva-Trifonova, Tsvetanka
    PROCEEDINGS OF THE FUTURE TECHNOLOGIES CONFERENCE (FTC) 2021, VOL 2, 2022, 359 : 333 - 353
  • [39] Few-Shot Text Classification with Global-Local Feature Information
    Wang, Depei
    Wang, Zhuowei
    Cheng, Lianglun
    Zhang, Weiwen
    SENSORS, 2022, 22 (12)
  • [40] HTCInfoMax: A Global Model for Hierarchical Text Classification via Information Maximization
    Deng, Zhongfen
    Peng, Hao
    He, Dongxiao
    Li, Jianxin
    Yu, Philip S.
    2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 3259 - 3265