A new feature selection method for handling redundant information in text classification

被引:9
|
作者
Wang, You-wei [1 ]
Feng, Li-zhou [2 ]
机构
[1] Cent Univ Finance & Econ, Sch Informat, Beijing 100081, Peoples R China
[2] Tianjin Univ Finance & Econ, Sch Sci & Engn, Tianjin 300222, Peoples R China
基金
中国国家自然科学基金; 北京市自然科学基金;
关键词
Feature selection; Dimensionality reduction; Text classification; Redundant features; Support vector machine; Naive Bayes; Mutual information; MUTUAL INFORMATION; HARMONY SEARCH; CATEGORIZATION; ALGORITHM;
D O I
10.1631/FITEE.1601761
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Feature selection is an important approach to dimensionality reduction in the field of text classification. Because of the difficulty in handling the problem that the selected features always contain redundant information, we propose a new simple feature selection method, which can effectively filter the redundant features. First, to calculate the relationship between two words, the definitions of word frequency based relevance and correlative redundancy are introduced. Furthermore, an optimal feature selection (OFS) method is chosen to obtain a feature subset FS1. Finally, to improve the execution speed, the redundant features in FS1 are filtered by combining a predetermined threshold, and the filtered features are memorized in the linked lists. Experiments are carried out on three datasets (WebKB, 20-Newsgroups, and Reuters-21578) where in support vector machines and na < ve Bayes are used. The results show that the classification accuracy of the proposed method is generally higher than that of typical traditional methods (information gain, improved Gini index, and improved comprehensively measured feature selection) and the OFS methods. Moreover, the proposed method runs faster than typical mutual information-based methods (improved and normalized mutual information-based feature selections, and multilabel feature selection based on maximum dependency and minimum redundancy) while simultaneously ensuring classification accuracy. Statistical results validate the effectiveness of the proposed method in handling redundant information in text classification.
引用
收藏
页码:221 / 234
页数:14
相关论文
共 50 条
  • [41] A New Method of Feature Selection for Flow Classification
    Sun, Meifeng
    Chen, Jingtao
    Zhang, Yun
    Shi, Shangzhe
    2010 INTERNATIONAL COLLOQUIUM ON COMPUTING, COMMUNICATION, CONTROL, AND MANAGEMENT (CCCM2010), VOL I, 2010, : 299 - 302
  • [42] A New Method of Text Feature Selection for Knowledge Discovery
    Zhang, Li
    Liu, Xing
    An, Rong
    Zhao, Xin
    Yi, Kejia
    PROCEEDINGS OF THE 2015 INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS RESEARCH AND MECHATRONICS ENGINEERING, 2015, 121 : 1787 - 1790
  • [43] Dynamic feature selection in text classification
    Doan, Son
    Horiguchi, Susumu
    INTELLIGENT CONTROL AND AUTOMATION, 2006, 344 : 664 - 675
  • [44] Contextual feature selection for text classification
    Paradis, Francois
    Nie, Jian-Yun
    INFORMATION PROCESSING & MANAGEMENT, 2007, 43 (02) : 344 - 352
  • [45] Hybrid feature selection for text classification
    Gunal, Serkan
    TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2012, 20 : 1296 - 1311
  • [46] Feature selection for text classification: A review
    Deng, Xuelian
    Li, Yuqing
    Weng, Jian
    Zhang, Jilian
    MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (03) : 3797 - 3816
  • [47] Feature Selection Strategy in Text Classification
    Fung, Pui Cheong Gabriel
    Morstatter, Fred
    Liu, Huan
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PT I: 15TH PACIFIC-ASIA CONFERENCE, PAKDD 2011, 2011, 6634 : 26 - 37
  • [48] Feature selection for text classification: A review
    Xuelian Deng
    Yuqing Li
    Jian Weng
    Jilian Zhang
    Multimedia Tools and Applications, 2019, 78 : 3797 - 3816
  • [49] Feature Selection for Ordinal Text Classification
    Baccianella, Stefano
    Esuli, Andrea
    Sebastiani, Fabrizio
    NEURAL COMPUTATION, 2014, 26 (03) : 557 - 591
  • [50] Feature Selection Methods for Text Classification
    Dasgupta, Anirban
    Drineas, Petros
    Harb, Boulos
    Josifovski, Vanja
    Mahoney, Michael W.
    KDD-2007 PROCEEDINGS OF THE THIRTEENTH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2007, : 230 - +