Improved Document Feature Selection with Categorical Parameter for Text Classification

被引:1
|
作者
Wang, Fen [1 ]
Li, Xiaoxuan [1 ]
Huang, Xiaotao [1 ]
Kang, Ling [2 ]
机构
[1] Huazhong Univ Sci & Technol, Dept Comp Sci & Technol, Wuhan, Hubei, Peoples R China
[2] Huazhong Univ Sci & Technol, Dept Hydropower & Informat Engn, Wuhan, Hubei, Peoples R China
来源
MOBILE, SECURE, AND PROGRAMMABLE NETWORKING (MSPN 2016) | 2016年 / 10026卷
关键词
Feature selection; Measurement; Comparison; Time efficiency; Experimentation; FEATURE-EXTRACTION;
D O I
10.1007/978-3-319-50463-6_8
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Social network develops rapidly and thousands of new data appears on the Internet every day. Classification technology is the key to organize big data. Feature Selection (FS) is a direct way to improve classification efficiency. FS can reduce the size of the feature subset and ensure classification accuracy based on features' score, which is calculated by FS methods. Most previous studies of FS emphasized on precision while time-efficiency was commonly ignored. In our study, we proposed a method named CDFDC at first. It combines both CDF and Category-Frequency. Secondly, we compared DF, CDF, CHI, IG, CDFP VM and CDFDC to figure out the relationships among algorithm complexity, time efficiency and classification accuracy. The experiment is implemented with 20-news-group data set and NB classifier. The performance of the FS methods evaluated by seven aspects: precision, Micro F1, Macro F1, feature-selection-time, documents-conversion-time, training-time and classification-time. The result shows that the proposed method performs well on efficiency and accuracy when the size of feature subset is greater than 3,000. And it is also discovered that FS algorithm's complexity is unrelated to accuracy but complexity can ensure time stability and predictability.
引用
收藏
页码:86 / 98
页数:13
相关论文
共 50 条
  • [21] A Review on Feature Selection and Feature Extraction for Text Classification
    Shah, Foram P.
    Patel, Vibha
    PROCEEDINGS OF THE 2016 IEEE INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, SIGNAL PROCESSING AND NETWORKING (WISPNET), 2016, : 2264 - 2268
  • [22] Feature selection for the classification of large document collections
    Brank, Janez
    Mladenic, Dunja
    Grobelnik, Marko
    Milic-Frayling, Natasa
    JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2008, 14 (10) : 1562 - 1596
  • [23] The impact of feature selection on medical document classification
    Parlak, Bekir
    Uysal, Alper Kursat
    2016 11TH IBERIAN CONFERENCE ON INFORMATION SYSTEMS AND TECHNOLOGIES (CISTI), 2016,
  • [24] Feature selection for document classification based on topology
    El Barbary, O. G.
    Salama, A. S.
    EGYPTIAN INFORMATICS JOURNAL, 2018, 19 (02) : 129 - 132
  • [25] Discriminative Feature Analysis and Selection for Document Classification
    Chinta, Punya Murthy
    Murty, M. Narasimha
    NEURAL INFORMATION PROCESSING, ICONIP 2012, PT I, 2012, 7663 : 366 - 374
  • [26] The Influence of Feature Representation of Text on the Performance of Document Classification
    Martincic-Ipsic, Sanda
    Milicic, Tanja
    Todorovski, Ljupco
    APPLIED SCIENCES-BASEL, 2019, 9 (04):
  • [27] OPTIMAL FEATURE SUBSET SELECTION BASED ON COMBINING DOCUMENT FREQUENCY AND TERM FREQUENCY FOR TEXT CLASSIFICATION
    Karpagalingam, Thirumoorthy
    Karuppaiah, Muneeswaran
    COMPUTING AND INFORMATICS, 2020, 39 (05) : 881 - 906
  • [28] Optimal feature subset selection based on combining document frequency and term frequency for text classification
    Karpagalingam T.
    Karuppaiah M.
    1600, Slovak Academy of Sciences (39): : 881 - 906
  • [29] Categorical Term Frequency Probability Based Feature Selection for Document Categorization
    Li, Qiang
    He, Liang
    Lin, Xin
    2013 INTERNATIONAL CONFERENCE OF SOFT COMPUTING AND PATTERN RECOGNITION (SOCPAR), 2013, : 60 - 65
  • [30] Comparison on Feature Selection Methods for Text Classification
    Liu, Wenkai
    Xiao, Jiongen
    Hong, Ming
    2020 THE 4TH INTERNATIONAL CONFERENCE ON MANAGEMENT ENGINEERING, SOFTWARE ENGINEERING AND SERVICE SCIENCES (ICMSS 2020), 2020, : 82 - 86