Composite Feature Extraction and Selection for Text Classification

被引:15
|
作者
Wan, Chuan [1 ]
Wang, Yuling [1 ]
Liu, Yaoze [1 ]
Ji, Jinchao [1 ]
Feng, Guozhong [1 ,2 ]
机构
[1] Northeast Normal Univ, Sch Informat Sci & Technol, Changchun 130117, Jilin, Peoples R China
[2] Northeast Normal Univ, Key Lab Appl Stat, MOE, Changchun 130024, Jilin, Peoples R China
来源
IEEE ACCESS | 2019年 / 7卷
基金
中国国家自然科学基金;
关键词
Composite feature extraction; composite feature selection; redundancy; text classification; RELEVANCE; SCHEME; MODEL;
D O I
10.1109/ACCESS.2019.2904602
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Although words are basic semantic units in text, phrases, and expressions contain additional information, which is important for text classification. To capture this information, traditional algorithms extract composite features via word sequences or co-occurrences, such as bigrams and termsets, but ignore the influence of stop words and punctuation, which results in huge amounts of weak features. In this paper, we propose a text structure-based algorithm to extract composite features. Termsets that cross punctuation marks or stop words in the text are excluded. To eliminate redundancy, a novel discriminative measure containing two factors is suggested. One is employed to measure the relevancy, while the other is incorporated to increase the values of composite features, whose class frequencies are much smaller than those of their sub-features. The experiments on three benchmark datasets with both a support vector machine and a naive Bayes classifier illustrate the effectiveness of the approach.
引用
收藏
页码:35208 / 35219
页数:12
相关论文
共 50 条
  • [31] A new approach to feature selection in text classification
    Wang, Y
    Wang, XJ
    [J]. PROCEEDINGS OF 2005 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-9, 2005, : 3814 - 3819
  • [32] Text Feature Extraction and Selection Based on Attention Mechanism
    Ma, Longxuan
    Zhang, Lei
    [J]. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2019, PT II, 2019, 11440 : 615 - 627
  • [33] Feature Extraction in Subject Classification of Text Documents in Polish
    Walkowiak, Tomasz
    Datko, Szymon
    Maciejewski, Henryk
    [J]. ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING (ICAISC 2018), PT II, 2018, 10842 : 445 - 452
  • [34] Lexicon based feature extraction for emotion text classification
    Bandhakavi, Anil
    Wiratunga, Nirmalie
    Padmanabhan, Deepak
    Massie, Stewart
    [J]. PATTERN RECOGNITION LETTERS, 2017, 93 : 133 - 142
  • [35] Review of feature extraction approaches on biomedical text classification
    Dollah, Rozilawati
    Jafni, Tiara Izrinda
    Hashim, Haslina
    Othman, Mohd Shahizan
    Rasib, Abd Wahid
    [J]. INTERNATIONAL JOURNAL OF ADVANCED AND APPLIED SCIENCES, 2020, 7 (04): : 1 - 8
  • [36] Text Classification using Different Feature Extraction Approaches
    Dzisevic, Robert
    Sesok, Dmitrij
    [J]. 2019 OPEN CONFERENCE OF ELECTRICAL, ELECTRONIC AND INFORMATION SCIENCES (ESTREAM), 2019,
  • [37] A Novel Feature Selection and Extraction Technique for Classification
    Goel, Kratarth
    Vohra, Raunaq
    Bakshi, Ainesh
    [J]. 2014 14TH INTERNATIONAL CONFERENCE ON FRONTIERS IN HANDWRITING RECOGNITION (ICFHR), 2014, : 104 - 109
  • [38] A Novel Feature Selection and Extraction Technique for Classification
    Goel, Kratarch
    Vohra, Raunaq
    Bakshi, Ainesh
    [J]. 2014 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS (SMC), 2014, : 4033 - 4034
  • [39] Two new feature selection metrics for text classification
    Sahin, Durmus Ozkan
    Kilic, Erdal
    [J]. AUTOMATIKA, 2019, 60 (02) : 162 - 171
  • [40] A feature selection algorithm with redundancy reduction for text classification
    Saleh, Sherine Nagi
    El-Sonbaty, Yasser
    [J]. 2007 22ND INTERNATIONAL SYMPOSIUM ON COMPUTER AND INFORMATION SCIENCES, 2007, : 130 - +