A feature selection algorithm for document clustering based on word co-occurence frequency

被引：0

作者：

Liu, YC ^{[1
]}

Wang, XL ^{[1
]}

Liu, BQ ^{[1
]}

机构：

[1] Harbin Inst Technol, Sch Comp Sci & Technol, Harbin 150001, Peoples R China

来源：

PROCEEDINGS OF THE 2004 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7 | 2004年

关键词：

document-clustering; feature selection; cluster hypothesis; word co-occurrence frequency;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Constructing feature space by only selecting more informative words can speed up document clustering algorithm greatly, and the cluster quality will not be affected. In this paper the impact of feature selection on document clustering is discussed firstly, then a new solution for feature selection was brought forward which is based on word co-occurrence frequency. According to cluster hypothesis, the documents from the same class are more similar to each other when they are represented in vector space model (VSM), so many of the words from these documents will always be in company with each other. We find these words by word co-occurrence, and then construct reduced feature space for clustering. Experiments show that the selected features are more salient. Clustering documents in the new reduced feature space, run time is shortened greatly, whereas the cluster quality is almost unchanged, thus make clustering algorithm more suitable for practical use.

引用

页码：2963 / 2968

页数：6

共 50 条

[1] Genetic Algorithm for Feature Selection of MR Brain Images Using Wavelet Co-occurence
Kharrat, Ahmed
Benamrane, Nacera
Ben Messaoud, Mohamed
Abid, Mohamed
[J]. INTERNATIONAL CONFERENCE ON GRAPHIC AND IMAGE PROCESSING (ICGIP 2011), 2011, 8285
[2] Sampling and feature selection in a genetic algorithm for document clustering
Casillas, A
de Lena, MTG
Martínez, R
[J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, 2004, 2945 : 601 - 612
[3] LDA Based Feature Selection for Document Clustering
Kumar, B. Shravan
Ravi, Vadlamani
[J]. COMPUTE'17: PROCEEDINGS OF THE 10TH ANNUAL ACM INDIA COMPUTE CONFERENCE, 2017, : 125 - 130
[4] Feature selection and document clustering
Dhillon, I
Kogan, J
Nicholas, C
[J]. SURVEY OF TEXT MINING: CLUSTERING, CLASSIFICATION, AND RETRIEVAL, 2004, : 73 - 100
[5] Text sentiment classification based on a genetic algorithm and word and document co-clustering
Kotelnikov, E. V.
Pletneva, M. V.
[J]. JOURNAL OF COMPUTER AND SYSTEMS SCIENCES INTERNATIONAL, 2016, 55 (01) : 106 - 114
[6] Text sentiment classification based on a genetic algorithm and word and document co-clustering
E. V. Kotelnikov
M. V. Pletneva
[J]. Journal of Computer and Systems Sciences International, 2016, 55 : 106 - 114
[7] Human interaction recognition based on the co-occurence of visual words
Slimani, K. Nour el Houda
Benezeth, Yannick
Souami, Feriel
[J]. 2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW), 2014, : 461 - +
[8] ON IMAGE-ENHANCEMENT AND THRESHOLD SELECTION USING THE GRAYLEVEL CO-OCCURENCE MATRIX
CHANDA, B
CHAUDHURI, BB
MAJUMDER, DD
[J]. PATTERN RECOGNITION LETTERS, 1985, 3 (04) : 243 - 251
[9] A Clustering Based Genetic Algorithm for Feature Selection
Rostami, Mehrdad
Moradi, Parham
[J]. 2014 6TH CONFERENCE ON INFORMATION AND KNOWLEDGE TECHNOLOGY (IKT), 2014, : 112 - 116
[10] A fuzzy clustering based algorithm for feature selection
Sun, HJ
Wang, SR
Mei, Z
[J]. 2002 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-4, PROCEEDINGS, 2002, : 1993 - 1998

← 1 2 3 4 5 →