Improved Document Feature Selection with Categorical Parameter for Text Classification

被引:1
|
作者
Wang, Fen [1 ]
Li, Xiaoxuan [1 ]
Huang, Xiaotao [1 ]
Kang, Ling [2 ]
机构
[1] Huazhong Univ Sci & Technol, Dept Comp Sci & Technol, Wuhan, Hubei, Peoples R China
[2] Huazhong Univ Sci & Technol, Dept Hydropower & Informat Engn, Wuhan, Hubei, Peoples R China
关键词
Feature selection; Measurement; Comparison; Time efficiency; Experimentation; FEATURE-EXTRACTION;
D O I
10.1007/978-3-319-50463-6_8
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Social network develops rapidly and thousands of new data appears on the Internet every day. Classification technology is the key to organize big data. Feature Selection (FS) is a direct way to improve classification efficiency. FS can reduce the size of the feature subset and ensure classification accuracy based on features' score, which is calculated by FS methods. Most previous studies of FS emphasized on precision while time-efficiency was commonly ignored. In our study, we proposed a method named CDFDC at first. It combines both CDF and Category-Frequency. Secondly, we compared DF, CDF, CHI, IG, CDFP VM and CDFDC to figure out the relationships among algorithm complexity, time efficiency and classification accuracy. The experiment is implemented with 20-news-group data set and NB classifier. The performance of the FS methods evaluated by seven aspects: precision, Micro F1, Macro F1, feature-selection-time, documents-conversion-time, training-time and classification-time. The result shows that the proposed method performs well on efficiency and accuracy when the size of feature subset is greater than 3,000. And it is also discovered that FS algorithm's complexity is unrelated to accuracy but complexity can ensure time stability and predictability.
引用
收藏
页码:86 / 98
页数:13
相关论文
共 50 条
  • [1] Weighted Document Frequency for Feature Selection in Text Classification
    Li, Baoli
    Yan, Qiuling
    Xu, Zhenqiang
    Wang, Guicai
    [J]. PROCEEDINGS OF 2015 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING, 2015, : 132 - 135
  • [2] An improved global feature selection scheme for text classification
    Uysal, Alper Kursat
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2016, 43 : 82 - 92
  • [3] Traditional and Swarm Intelligent Based Text Feature Selection for Document Classification
    Kyaw, Khin Sandar
    Limsiroratana, Somchai
    [J]. ISCIT 2019: PROCEEDINGS OF 2019 19TH INTERNATIONAL SYMPOSIUM ON COMMUNICATIONS AND INFORMATION TECHNOLOGIES (ISCIT), 2019, : 226 - 231
  • [4] Feature selection using improved mutual information for text classification
    Novovicová, J
    Malík, A
    Pudil, P
    [J]. STRUCTURAL, SYNTACTIC, AND STATISTICAL PATTERN RECOGNITION, PROCEEDINGS, 2004, 3138 : 1010 - 1017
  • [5] Feature Selection in Text Classification
    Sahin, Durmus Ozkan
    Ates, Nurullah
    Kilic, Erdal
    [J]. 2016 24TH SIGNAL PROCESSING AND COMMUNICATION APPLICATION CONFERENCE (SIU), 2016, : 1777 - 1780
  • [6] An improved document classification approach with maximum entropy and entropy feature selection
    Pang, Xiu-Li
    Feng, Yu-Qiang
    Jiang, Wei
    [J]. PROCEEDINGS OF 2007 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7, 2007, : 3911 - +
  • [7] Feature selection algorithm for text classification based on improved mutual information
    丛帅
    张积宾
    徐志明
    王宇颖
    [J]. Journal of Harbin Institute of Technology(New series), 2011, (03) : 144 - 148
  • [8] An improved method of feature selection based on concept attributes in text classification
    Liao, SS
    Jiang, MH
    [J]. ADVANCES IN NATURAL COMPUTATION, PT 1, PROCEEDINGS, 2005, 3610 : 1140 - 1149
  • [9] Feature selection for document type classification
    Taghva, Kazem
    Vergara, Jason
    [J]. PROCEEDINGS OF THE FIFTH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY: NEW GENERATIONS, 2008, : 179 - 182
  • [10] Investigating Optimal Feature Selection Method to Improve the Performance of Amharic Text Document Classification
    Alemu, Tamir Anteneh
    Tegegnie, Alemu Kumilachew
    [J]. AFRICAN JOURNAL OF LIBRARY ARCHIVES AND INFORMATION SCIENCE, 2019, 29 (02): : 103 - 113