Composite Feature Extraction and Selection for Text Classification

被引:15
|
作者
Wan, Chuan [1 ]
Wang, Yuling [1 ]
Liu, Yaoze [1 ]
Ji, Jinchao [1 ]
Feng, Guozhong [1 ,2 ]
机构
[1] Northeast Normal Univ, Sch Informat Sci & Technol, Changchun 130117, Jilin, Peoples R China
[2] Northeast Normal Univ, Key Lab Appl Stat, MOE, Changchun 130024, Jilin, Peoples R China
来源
IEEE ACCESS | 2019年 / 7卷
基金
中国国家自然科学基金;
关键词
Composite feature extraction; composite feature selection; redundancy; text classification; RELEVANCE; SCHEME; MODEL;
D O I
10.1109/ACCESS.2019.2904602
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Although words are basic semantic units in text, phrases, and expressions contain additional information, which is important for text classification. To capture this information, traditional algorithms extract composite features via word sequences or co-occurrences, such as bigrams and termsets, but ignore the influence of stop words and punctuation, which results in huge amounts of weak features. In this paper, we propose a text structure-based algorithm to extract composite features. Termsets that cross punctuation marks or stop words in the text are excluded. To eliminate redundancy, a novel discriminative measure containing two factors is suggested. One is employed to measure the relevancy, while the other is incorporated to increase the values of composite features, whose class frequencies are much smaller than those of their sub-features. The experiments on three benchmark datasets with both a support vector machine and a naive Bayes classifier illustrate the effectiveness of the approach.
引用
收藏
页码:35208 / 35219
页数:12
相关论文
共 50 条
  • [1] A Review on Feature Selection and Feature Extraction for Text Classification
    Shah, Foram P.
    Patel, Vibha
    [J]. PROCEEDINGS OF THE 2016 IEEE INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, SIGNAL PROCESSING AND NETWORKING (WISPNET), 2016, : 2264 - 2268
  • [2] Feature Selection in Text Classification
    Sahin, Durmus Ozkan
    Ates, Nurullah
    Kilic, Erdal
    [J]. 2016 24TH SIGNAL PROCESSING AND COMMUNICATION APPLICATION CONFERENCE (SIU), 2016, : 1777 - 1780
  • [3] Abstract feature extraction for text classification
    Biricik, Goksel
    Diri, Banu
    Sonmez, Ahmet Coskun
    [J]. TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2012, 20 : 1137 - 1159
  • [4] Dynamic feature selection in text classification
    Doan, Son
    Horiguchi, Susumu
    [J]. INTELLIGENT CONTROL AND AUTOMATION, 2006, 344 : 664 - 675
  • [5] Contextual feature selection for text classification
    Paradis, Francois
    Nie, Jian-Yun
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2007, 43 (02) : 344 - 352
  • [6] Hybrid feature selection for text classification
    Gunal, Serkan
    [J]. TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2012, 20 : 1296 - 1311
  • [7] Feature selection for text classification: A review
    Deng, Xuelian
    Li, Yuqing
    Weng, Jian
    Zhang, Jilian
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (03) : 3797 - 3816
  • [8] Feature Selection Strategy in Text Classification
    Fung, Pui Cheong Gabriel
    Morstatter, Fred
    Liu, Huan
    [J]. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PT I: 15TH PACIFIC-ASIA CONFERENCE, PAKDD 2011, 2011, 6634 : 26 - 37
  • [9] Feature selection for text classification: A review
    Xuelian Deng
    Yuqing Li
    Jian Weng
    Jilian Zhang
    [J]. Multimedia Tools and Applications, 2019, 78 : 3797 - 3816
  • [10] Feature Selection for Ordinal Text Classification
    Baccianella, Stefano
    Esuli, Andrea
    Sebastiani, Fabrizio
    [J]. NEURAL COMPUTATION, 2014, 26 (03) : 557 - 591