A feature selection algorithm for document clustering based on word co-occurence frequency

被引:0
|
作者
Liu, YC [1 ]
Wang, XL [1 ]
Liu, BQ [1 ]
机构
[1] Harbin Inst Technol, Sch Comp Sci & Technol, Harbin 150001, Peoples R China
关键词
document-clustering; feature selection; cluster hypothesis; word co-occurrence frequency;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Constructing feature space by only selecting more informative words can speed up document clustering algorithm greatly, and the cluster quality will not be affected. In this paper the impact of feature selection on document clustering is discussed firstly, then a new solution for feature selection was brought forward which is based on word co-occurrence frequency. According to cluster hypothesis, the documents from the same class are more similar to each other when they are represented in vector space model (VSM), so many of the words from these documents will always be in company with each other. We find these words by word co-occurrence, and then construct reduced feature space for clustering. Experiments show that the selected features are more salient. Clustering documents in the new reduced feature space, run time is shortened greatly, whereas the cluster quality is almost unchanged, thus make clustering algorithm more suitable for practical use.
引用
收藏
页码:2963 / 2968
页数:6
相关论文
共 50 条
  • [1] Genetic Algorithm for Feature Selection of MR Brain Images Using Wavelet Co-occurence
    Kharrat, Ahmed
    Benamrane, Nacera
    Ben Messaoud, Mohamed
    Abid, Mohamed
    [J]. INTERNATIONAL CONFERENCE ON GRAPHIC AND IMAGE PROCESSING (ICGIP 2011), 2011, 8285
  • [2] Sampling and feature selection in a genetic algorithm for document clustering
    Casillas, A
    de Lena, MTG
    Martínez, R
    [J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, 2004, 2945 : 601 - 612
  • [3] LDA Based Feature Selection for Document Clustering
    Kumar, B. Shravan
    Ravi, Vadlamani
    [J]. COMPUTE'17: PROCEEDINGS OF THE 10TH ANNUAL ACM INDIA COMPUTE CONFERENCE, 2017, : 125 - 130
  • [4] Feature selection and document clustering
    Dhillon, I
    Kogan, J
    Nicholas, C
    [J]. SURVEY OF TEXT MINING: CLUSTERING, CLASSIFICATION, AND RETRIEVAL, 2004, : 73 - 100
  • [5] Text sentiment classification based on a genetic algorithm and word and document co-clustering
    Kotelnikov, E. V.
    Pletneva, M. V.
    [J]. JOURNAL OF COMPUTER AND SYSTEMS SCIENCES INTERNATIONAL, 2016, 55 (01) : 106 - 114
  • [6] Text sentiment classification based on a genetic algorithm and word and document co-clustering
    E. V. Kotelnikov
    M. V. Pletneva
    [J]. Journal of Computer and Systems Sciences International, 2016, 55 : 106 - 114
  • [7] Human interaction recognition based on the co-occurence of visual words
    Slimani, K. Nour el Houda
    Benezeth, Yannick
    Souami, Feriel
    [J]. 2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW), 2014, : 461 - +
  • [8] ON IMAGE-ENHANCEMENT AND THRESHOLD SELECTION USING THE GRAYLEVEL CO-OCCURENCE MATRIX
    CHANDA, B
    CHAUDHURI, BB
    MAJUMDER, DD
    [J]. PATTERN RECOGNITION LETTERS, 1985, 3 (04) : 243 - 251
  • [9] A Clustering Based Genetic Algorithm for Feature Selection
    Rostami, Mehrdad
    Moradi, Parham
    [J]. 2014 6TH CONFERENCE ON INFORMATION AND KNOWLEDGE TECHNOLOGY (IKT), 2014, : 112 - 116
  • [10] A fuzzy clustering based algorithm for feature selection
    Sun, HJ
    Wang, SR
    Mei, Z
    [J]. 2002 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-4, PROCEEDINGS, 2002, : 1993 - 1998