A new unsupervised feature selection method for text clustering based on genetic algorithms

被引:0
|
作者
Pirooz Shamsinejadbabki
Mohammad Saraee
机构
[1] Isfahan University of Technology,Intelligent Database Systems, Data Mining and Bioinformatics Research Laboratory, Department of Electrical and Computer Engineering
关键词
Text clustering; Unsupervised feature selection; Genetic algorithm;
D O I
暂无
中图分类号
学科分类号
摘要
Nowadays a vast amount of textual information is collected and stored in various databases around the world, including the Internet as the largest database of all. This rapidly increasing growth of published text means that even the most avid reader cannot hope to keep up with all the reading in a field and consequently the nuggets of insight or new knowledge are at risk of languishing undiscovered in the literature. Text mining offers a solution to this problem by replacing or supplementing the human reader with automatic systems undeterred by the text explosion. It involves analyzing a large collection of documents to discover previously unknown information. Text clustering is one of the most important areas in text mining, which includes text preprocessing, dimension reduction by selecting some terms (features) and finally clustering using selected terms. Feature selection appears to be the most important step in the process. Conventional unsupervised feature selection methods define a measure of the discriminating power of terms to select proper terms from corpus. However up to now the valuation of terms in groups has not been investigated in reported works. In this paper a new and robust unsupervised feature selection approach is proposed that evaluates terms in groups. In addition a new Modified Term Variance measuring method is proposed for evaluating groups of terms. Furthermore a genetic based algorithm is designed and implemented for finding the most valuable groups of terms based on the new measure. These terms then will be utilized to generate the final feature vector for the clustering process . In order to evaluate and justify our approach the proposed method and also a conventional term variance method are implemented and tested using corpus collection Reuters-21578. For a more accurate comparison, methods have been tested on three corpuses and for each corpus clustering task has been done ten times and results are averaged. Results of comparing these two methods are very promising and show that our method produces better average accuracy and F1-measure than the conventional term variance method.
引用
收藏
页码:669 / 684
页数:15
相关论文
共 50 条
  • [1] A new unsupervised feature selection method for text clustering based on genetic algorithms
    Shamsinejadbabki, Pirooz
    Saraee, Mohammad
    [J]. JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2012, 38 (03) : 669 - 684
  • [2] Unsupervised Feature Selection Technique Based on Genetic Algorithm for Improving the Text Clustering
    Abualigah, Laith Mohammad
    Khader, Ahamad Tajudin
    Al-Betar, Mohammed Azmi
    [J]. 2016 7TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND INFORMATION TECHNOLOGY (CSIT), 2016,
  • [3] Spectral Clustering Based Unsupervised Feature Selection Algorithms
    Xie, Juan-Ying
    Ding, Li-Juan
    Wang, Ming-Zhao
    [J]. Ruan Jian Xue Bao/Journal of Software, 2020, 31 (04): : 1009 - 1024
  • [4] A New Feature Selection Method for Text Clustering
    XU Junling1
    2. State Key Laboratory of Software Engineering
    3. Department of Computer Science and Engineering
    [J]. Wuhan University Journal of Natural Sciences, 2007, (05) : 912 - 916
  • [5] FCFilter: Feature Selection based on Clustering and Genetic Algorithms
    Ferreira, Charles H. P.
    de Medeiros, Debora M. R.
    Santana, Fabiana
    [J]. 2016 IEEE CONGRESS ON EVOLUTIONARY COMPUTATION (CEC), 2016, : 2106 - 2113
  • [6] Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering
    Abualigah, Laith Mohammad
    Khader, Ahamad Tajudin
    [J]. JOURNAL OF SUPERCOMPUTING, 2017, 73 (11): : 4773 - 4795
  • [7] Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering
    Laith Mohammad Abualigah
    Ahamad Tajudin Khader
    [J]. The Journal of Supercomputing, 2017, 73 : 4773 - 4795
  • [8] A Feature Selection Method Based on Genetic Algorithms
    Jiang, Mingyang
    Fan, Xiaojing
    Zhang, Xinhong
    Jie, Lian
    Zhou, Yuxin
    Wang, QiangHu
    Zhang, ZhiFeng
    Pei, Zhili
    [J]. PROCEEDINGS OF THE 2014 INTERNATIONAL CONFERENCE ON MECHATRONICS, ELECTRONIC, INDUSTRIAL AND CONTROL ENGINEERING, 2014, 5 : 914 - +
  • [9] A comparative study on unsupervised feature selection methods for text clustering
    Liu, LY
    Kang, JC
    Yu, J
    Wang, ZL
    [J]. PROCEEDINGS OF THE 2005 IEEE INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING (IEEE NLP-KE'05), 2005, : 597 - 601
  • [10] Unsupervised text feature selection by binary fire hawk optimizer for text clustering
    Msallam, Mohammed M.
    Bin Idris, Syahril Anuar
    [J]. CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2024, 27 (06): : 7721 - 7740