Sampling and feature selection in a genetic algorithm for document clustering

被引:0
|
作者
Casillas, A [1 ]
de Lena, MTG
Martínez, R
机构
[1] Univ Basque Country, Dpt Elect & Elect, E-48080 Bilbao, Spain
[2] Univ Rey Juan Carlos, Dpt Informat Estadist & Telemat, Madrid, Spain
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper we describe a Genetic Algorithm for document clustering that includes a sampling technique to reduce computation time. This algorithm calculates an approximation of the optimum k value, and solves the best grouping of the documents into these k clusters. We evaluate this algorithm with sets of documents that are the output of a query in a search engine. Two types of experiment are carried out to determine: (1) how the genetic algorithm works with a sample of documents, (2) which document features lead to the best clustering according to an external evaluation. On the one hand, our CA with sampling performs the clustering in a time that makes interaction with a search engine viable. On the other hand, our CA approach with the representation of the documents by means of entities leads to better results than representation by lemmas only.
引用
收藏
页码:601 / 612
页数:12
相关论文
共 50 条
  • [1] A Clustering Based Genetic Algorithm for Feature Selection
    Rostami, Mehrdad
    Moradi, Parham
    [J]. 2014 6TH CONFERENCE ON INFORMATION AND KNOWLEDGE TECHNOLOGY (IKT), 2014, : 112 - 116
  • [2] Feature selection and document clustering
    Dhillon, I
    Kogan, J
    Nicholas, C
    [J]. SURVEY OF TEXT MINING: CLUSTERING, CLASSIFICATION, AND RETRIEVAL, 2004, : 73 - 100
  • [3] A feature selection Bayesian approach for a clustering genetic algorithm
    Hruschka, ER
    Hruschka, ER
    Ebecken, NFF
    [J]. DATA MINING IV, 2004, 7 : 181 - 192
  • [4] Feature selection by integrating document frequency with genetic algorithm for Amharic news document classification
    Endalie, Demeke
    Haile, Getamesay
    Abebe, Wondmagegn Taye
    [J]. PEERJ COMPUTER SCIENCE, 2022, 8
  • [5] LDA Based Feature Selection for Document Clustering
    Kumar, B. Shravan
    Ravi, Vadlamani
    [J]. COMPUTE'17: PROCEEDINGS OF THE 10TH ANNUAL ACM INDIA COMPUTE CONFERENCE, 2017, : 125 - 130
  • [6] A Feature Selection for Korean Web Document Clustering
    Park, Heum
    Kim, Young-Gi
    Kwon, Hyuk-Chul
    [J]. IECON 2004: 30TH ANNUAL CONFERENCE OF IEEE INDUSTRIAL ELECTRONICS SOCIETY, VOL 3, 2004, : 2650 - 2654
  • [7] A feature selection algorithm for document clustering based on word co-occurence frequency
    Liu, YC
    Wang, XL
    Liu, BQ
    [J]. PROCEEDINGS OF THE 2004 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7, 2004, : 2963 - 2968
  • [8] Application of Genetic Algorithm in Document Clustering
    Wei Jian-Xiang
    Liu Huai
    Sun Yue-hong
    Su Xin-Ning
    [J]. 2009 INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY AND COMPUTER SCIENCE, VOL 1, PROCEEDINGS, 2009, : 145 - +
  • [9] Unsupervised Feature Selection Technique Based on Genetic Algorithm for Improving the Text Clustering
    Abualigah, Laith Mohammad
    Khader, Ahamad Tajudin
    Al-Betar, Mohammed Azmi
    [J]. 2016 7TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND INFORMATION TECHNOLOGY (CSIT), 2016,
  • [10] A feature selection bayesian approach for extracting classification rules with a clustering genetic algorithm
    Hruschka, ER
    Hruschka, ER
    Ebecken, NFF
    [J]. APPLIED ARTIFICIAL INTELLIGENCE, 2003, 17 (5-6) : 489 - 506