A new feature selection method for handling redundant information in text classification

被引:9
|
作者
Wang, You-wei [1 ]
Feng, Li-zhou [2 ]
机构
[1] Cent Univ Finance & Econ, Sch Informat, Beijing 100081, Peoples R China
[2] Tianjin Univ Finance & Econ, Sch Sci & Engn, Tianjin 300222, Peoples R China
基金
中国国家自然科学基金; 北京市自然科学基金;
关键词
Feature selection; Dimensionality reduction; Text classification; Redundant features; Support vector machine; Naive Bayes; Mutual information; MUTUAL INFORMATION; HARMONY SEARCH; CATEGORIZATION; ALGORITHM;
D O I
10.1631/FITEE.1601761
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Feature selection is an important approach to dimensionality reduction in the field of text classification. Because of the difficulty in handling the problem that the selected features always contain redundant information, we propose a new simple feature selection method, which can effectively filter the redundant features. First, to calculate the relationship between two words, the definitions of word frequency based relevance and correlative redundancy are introduced. Furthermore, an optimal feature selection (OFS) method is chosen to obtain a feature subset FS1. Finally, to improve the execution speed, the redundant features in FS1 are filtered by combining a predetermined threshold, and the filtered features are memorized in the linked lists. Experiments are carried out on three datasets (WebKB, 20-Newsgroups, and Reuters-21578) where in support vector machines and na < ve Bayes are used. The results show that the classification accuracy of the proposed method is generally higher than that of typical traditional methods (information gain, improved Gini index, and improved comprehensively measured feature selection) and the OFS methods. Moreover, the proposed method runs faster than typical mutual information-based methods (improved and normalized mutual information-based feature selections, and multilabel feature selection based on maximum dependency and minimum redundancy) while simultaneously ensuring classification accuracy. Statistical results validate the effectiveness of the proposed method in handling redundant information in text classification.
引用
收藏
页码:221 / 234
页数:14
相关论文
共 50 条
  • [21] Information-theoretic feature selection algorithms for text classification
    Novovicová, J
    Malík, A
    PROCEEDINGS OF THE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), VOLS 1-5, 2005, : 3272 - 3277
  • [22] Two new feature selection metrics for text classification
    Sahin, Durmus Ozkan
    Kilic, Erdal
    AUTOMATIKA, 2019, 60 (02) : 162 - 171
  • [23] Text Feature Selection Method in battlefield information service
    Wang Kai
    Liu Jingzhi
    Wang Kai
    Gan Zhichun
    Cai Yanjun
    2016 17TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING, APPLICATIONS AND TECHNOLOGIES (PDCAT), 2016, : 216 - 220
  • [24] Improved Mutual Information Method For Text Feature Selection
    Ding Xiaoming
    Tang Yan
    PROCEEDINGS OF THE 2013 8TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE & EDUCATION (ICCSE 2013), 2013, : 163 - 166
  • [25] A New Feature Selection Method for Text Clustering
    XU Junling1
    2. State Key Laboratory of Software Engineering
    3. Department of Computer Science and Engineering
    Wuhan University Journal of Natural Sciences, 2007, (05) : 912 - 916
  • [26] Feature Selection in Text Classification
    Sahin, Durmus Ozkan
    Ates, Nurullah
    Kilic, Erdal
    2016 24TH SIGNAL PROCESSING AND COMMUNICATION APPLICATION CONFERENCE (SIU), 2016, : 1777 - 1780
  • [27] Research on Feature Selection and kNN Classification Method in Chinese Text Classification
    Xiao Chao
    Wu Ping
    PROCEEDINGS OF THE 2015 4TH NATIONAL CONFERENCE ON ELECTRICAL, ELECTRONICS AND COMPUTER ENGINEERING ( NCEECE 2015), 2016, 47 : 956 - 962
  • [28] A new feature extraction method for text classification
    Yildiz, H. Kemal
    Genctav, Murat
    Usta, Nurullah
    Diri, Banu
    Amasyali, M. Fatih
    2007 IEEE 15TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS, VOLS 1-3, 2007, : 326 - 329
  • [29] A Novel Feature Selection Method Based on Category Information Analysis for Class Prejudging in Text Classification
    Wang, Qiang
    Guan, Yi
    Wang, XiaoLong
    Xu, Zhiming
    INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2006, 6 (1A): : 113 - 119
  • [30] A NEW FEATURE SELECTION METHOD BASED ON CONCEPT EXTRACTION IN AUTOMATIC CHINESE TEXT CLASSIFICATION
    Liao, Shasha
    Jiang, Minghu
    NEW MATHEMATICS AND NATURAL COMPUTATION, 2007, 3 (03) : 331 - 347