Efficient text document clustering with new similarity measures

被引:8
|
作者
Lakshmi R. [1 ]
Baskar S. [2 ]
机构
[1] Department of Computer Science and Engineering, K.L.N. College of Engineering, Sivagangai District, Tamilnadu
[2] Department of Electrical and Electronics Engineering, Thiagarajar College of Engineering, Madurai, Tamilnadu
关键词
Accuracy; Document clustering; Entropy; F-measure; Recall; Similarity measures;
D O I
10.1504/IJBIDM.2021.111741
中图分类号
学科分类号
摘要
In this paper, two new similarity measures, namely distance of term frequency-based similarity measure (DTFSM) and presence of common terms-based similarity measure (PCTSM), are proposed to compute the similarity between two documents for improving the effectiveness of text document clustering. The effectiveness of the proposed similarity measures is evaluated on reuters-21578 and WebKB datasets for clustering the documents using K-means and K-means++ clustering algorithms. The results obtained by using the proposed DTFSM and PCTSM are significantly better than other measures for document clustering in terms of accuracy, entropy, recall and F-measure. It is evident that the proposed similarity measures not only improve the effectiveness of the text document clustering, but also reduce the complexity of similarity measures based on the number of required operations during text document clustering. Copyright © 2021 Inderscience Enterprises Ltd.
引用
收藏
页码:109 / 126
页数:17
相关论文
共 50 条
  • [1] Analysis of similarity measures with WordNet based text document clustering
    Sandhya, Nadella
    Govardhan, A.
    Advances in Intelligent and Soft Computing, 2012, 132 AISC : 703 - 714
  • [2] Analysis of Similarity Measures with WordNet Based Text Document Clustering
    Sandhya, Nadella
    Govardhan, A.
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INFORMATION SYSTEMS DESIGN AND INTELLIGENT APPLICATIONS 2012 (INDIA 2012), 2012, 132 : 703 - +
  • [3] Frequent Term Based Text Document Clustering Using Similarity Measures: A Novel Approach
    Gupta, Vijay Kumar
    Dutta, Maitreyee
    Kumar, Manoj
    2017 FOURTH INTERNATIONAL CONFERENCE ON IMAGE INFORMATION PROCESSING (ICIIP), 2017, : 164 - 169
  • [4] Comparative Analysis of Similarity Measures in Document Clustering
    Karun, Kavitha A.
    Philip, Mintu
    Lubna, K.
    2013 INTERNATIONAL CONFERENCE ON GREEN COMPUTING, COMMUNICATION AND CONSERVATION OF ENERGY (ICGCE), 2013, : 857 - 860
  • [5] An Intelligent Similarity Measure for Effective Text Document Clustering
    Aishwarya, M. L.
    Selvi, K.
    2016 INTERNATIONAL CONFERENCE ON COMPUTING TECHNOLOGIES AND INTELLIGENT DATA ENGINEERING (ICCTIDE'16), 2016,
  • [6] Data clustering using efficient similarity measures
    Bisandu, Desmond Bala
    Prasad, Rajesh
    Liman, Musa Muhammad
    JOURNAL OF STATISTICS AND MANAGEMENT SYSTEMS, 2019, 22 (05) : 901 - 922
  • [7] Efficient phrase-based document similarity for clustering
    Chim, Hung
    Deng, Xiaotie
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2008, 20 (09) : 1217 - 1229
  • [8] An Analysis of Efficient Clustering Methods for Estimates Similarity Measures
    Jagatheeshkumar, G.
    Brunda, S. Selva
    2017 4TH INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING AND COMMUNICATION SYSTEMS (ICACCS), 2017,
  • [9] Enhanced Distributed Document Clustering Algorithm Using Different Similarity Measures
    Narayanan, Neethi
    Judith, J. E.
    Jayakumari, J.
    2013 IEEE CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGIES (ICT 2013), 2013, : 545 - 550
  • [10] A New Similarity Measure for Document Classification and Text Mining
    Eminagaoglu, Mete
    Goksen, Yilmaz
    ECONOMIES OF THE BALKAN AND EASTERN EUROPEAN COUNTRIES, 2020, : 353 - 366