Postimpact similarity: a similarity measure for effective grouping of unlabelled text using spectral clustering

被引:5
|
作者
Roy, Arnab Kumar [1 ]
Basu, Tanmay [2 ]
机构
[1] LICHESSORG, Loire Valley, Maine & Loire, France
[2] Indian Inst Sci Educ & Res, Dept Data Sci & Engn, Bhopal, India
关键词
Text clustering; Data clustering; Applied machine learning; Data mining;
D O I
10.1007/s10115-022-01658-9
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The task of text clustering is to partition a set of text documents into different meaningful groups such that the documents in a particular cluster are more similar to each other than the documents of other clusters according to a similarity or dissimilarity measure. Therefore, the role of similarity measure is crucial for producing good-quality clusters. The content similarity between two documents is generally used to form individual clusters, and it is measured by considering shared terms between the documents. However, the same may not be effective for a reasonably large and high-dimensional corpus. Therefore, a similarity measure is proposed here to improve the performance of text clustering using spectral method. The proposed similarity measure between two documents assigns a score based on their content similarity and their individual similarity with the shared neighbours over the corpus. The effectiveness of the proposed document similarity measure has been tested for clustering of different standard corpora using spectral clustering method. The empirical results using some well-known text collections have shown that the proposed method performs better than the state-of-the-art text clustering techniques in terms of normalized mutual information, f-measure and v-measure.
引用
收藏
页码:723 / 742
页数:20
相关论文
共 50 条
  • [1] Postimpact similarity: a similarity measure for effective grouping of unlabelled text using spectral clustering
    Arnab Kumar Roy
    Tanmay Basu
    [J]. Knowledge and Information Systems, 2022, 64 : 723 - 742
  • [2] An Intelligent Similarity Measure for Effective Text Document Clustering
    Aishwarya, M. L.
    Selvi, K.
    [J]. 2016 INTERNATIONAL CONFERENCE ON COMPUTING TECHNOLOGIES AND INTELLIGENT DATA ENGINEERING (ICCTIDE'16), 2016,
  • [3] A Similarity Measure for Text Classification and Clustering
    Lin, Yung-Shen
    Jiang, Jung-Yi
    Lee, Shie-Jue
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2014, 26 (07) : 1575 - 1590
  • [4] Spectral clustering with fuzzy similarity measure
    Zhao, Feng
    Liu, Hanqiang
    Jiao, Licheng
    [J]. DIGITAL SIGNAL PROCESSING, 2011, 21 (06) : 701 - 709
  • [5] SPECTRAL CLUSTERING WITH A NEW SIMILARITY MEASURE
    Pan, Donghua
    Li, Juan
    [J]. 2011 3RD INTERNATIONAL CONFERENCE ON COMPUTER TECHNOLOGY AND DEVELOPMENT (ICCTD 2011), VOL 3, 2012, : 437 - 441
  • [6] An Improved Similarity Measure for Text Clustering and Classification
    Reddy, G. Suresh
    Kanth, T. V. Rajini
    Rao, A. Ananda
    [J]. ADVANCED SCIENCE LETTERS, 2015, 21 (11) : 3583 - 3590
  • [7] A Comment on "A Similarity Measure for Text Classification and Clustering"
    Nagwani, Naresh Kumar
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2015, 27 (09) : 2589 - 2590
  • [8] An improved Similarity Measure For Chinese Text Clustering
    Zhang, Shaolei
    Wang, Zhong
    Huang, Wei
    [J]. 2016 2ND INTERNATIONAL CONFERENCE ON MECHANICAL, ELECTRONIC AND INFORMATION TECHNOLOGY ENGINEERING (ICMITE 2016), 2016, : 141 - 144
  • [9] Consensus Similarity Measure for Short Text Clustering
    Shin, Youhyun
    Ahn, Yeonchan
    Jeon, Heesik
    Lee, Sang-goo
    [J]. 2015 26TH INTERNATIONAL WORKSHOP ON DATABASE AND EXPERT SYSTEMS APPLICATIONS (DEXA), 2015, : 264 - 268
  • [10] CFTDISM:Clustering Financial Text Documents Using Improved Similarity Measure
    Srikanth, Panigrahi
    Deverapalli, Dharmaiah
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMPUTING RESEARCH (ICCIC), 2017, : 865 - 868