Text Categorization Based on Clustering Feature Selection

被引:19
|
作者
Zhou, Xiaofei [1 ]
Hu, Yue [1 ]
Guo, Li [1 ]
机构
[1] Chinese Acad Sci, Inst Informat Engn, Beijing 100095, Peoples R China
关键词
Feature selection; text categorization; k-means;
D O I
10.1016/j.procs.2014.05.283
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
In this paper, we discuss a text categorization method based on k-means clustering feature selection. K-means is classical algorithm for data clustering in text mining, but it is seldom used for feature selection. For text data, the words that can express correct semantic in a class are usually good features. We use k-means method to capture several cluster centroids for each class, and then choose the high frequency words in centroids as the text features for categorization. The words extracted by k-means not only can represent each class clustering well, but also own high quality for semantic expression. On three normal text databases, classifiers based on our feature selection method exhibit better performances than original classifiers for text categorization. (C) 2014 Published by Elsevier B.V. Open access under CC BY-NC-ND license.
引用
下载
收藏
页码:398 / 405
页数:8
相关论文
共 50 条
  • [21] An examination of feature selection frameworks in text categorization
    How, BC
    Kiong, WT
    INFORMATION RETRIEVAL TECHNOLOGY, PROCEEDINGS, 2005, 3689 : 558 - 564
  • [22] Lazy learner text categorization algorithm based on embedded feature selection
    Yan Peng
    Zheng Xuefeng
    Zhu Jianyong
    Xiao Yunhong
    JOURNAL OF SYSTEMS ENGINEERING AND ELECTRONICS, 2009, 20 (03) : 651 - 659
  • [23] Study on mutual information-based feature selection for text categorization
    Xu, Yan
    Jones, Gareth
    Li, Jintao
    Wang, Bin
    Sun, Chunming
    Journal of Computational Information Systems, 2007, 3 (03): : 1007 - 1012
  • [24] Hybrid feature selection based on enhanced genetic algorithm for text categorization
    Ghareb, Abdullah Saeed
    Abu Bakar, Azuraliza
    Hamdan, Abdul Razak
    EXPERT SYSTEMS WITH APPLICATIONS, 2016, 49 : 31 - 47
  • [25] New Feature Selection Methods Based on Context Similarity for Text Categorization
    Chen, Yifei
    Han, Bingqing
    Hou, Ping
    2014 11TH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY (FSKD), 2014, : 598 - 604
  • [26] Research on the algorithm of feature selection based on Gini index for text categorization
    Shang, Wenqian
    Huang, Houkuan
    Liu, Yuling
    Lin, Yongmin
    Qu, Youli
    Dong, Hongbin
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2006, 43 (10): : 1688 - 1694
  • [27] Lazy learner text categorization algorithm based on embedded feature selection
    Yan Peng~(1
    2.China State Information Center
    Journal of Systems Engineering and Electronics, 2009, 20 (03) : 651 - 659
  • [28] Relative term-frequency based feature selection for text categorization
    Yang, SM
    Wu, XB
    Deng, ZH
    Zhang, M
    Yang, DQ
    2002 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-4, PROCEEDINGS, 2002, : 1432 - 1436
  • [29] Novel feature selection algorithm for Chinese text categorization based on CHI
    Cai Zhenliang
    Wang Jian
    Liu Jiqiang
    PROCEEDINGS OF 2016 IEEE 13TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING (ICSP 2016), 2016, : 1035 - 1039
  • [30] An Algorithm of Feature Selection in Text Categorization Based on Gini-index
    Zhu, Wei-Dong
    Wang, Bo
    Lin, Yong-Min
    PROCEEDINGS OF THE 2015 INTERNATIONAL CONFERENCE ON MANAGEMENT SCIENCE AND MANAGEMENT INNOVATION, 2015, 6 : 272 - 278