Document Similarity Using a Phrase Indexing Graph Model

被引:20
|
作者
Hammouda, Khaled M. [1 ]
Kamel, Mohamed S. [1 ]
机构
[1] Univ Waterloo, Dept Syst Design Engn, Waterloo, ON N2L 3G1, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
Document similarity; Phrase matching; Phrase indexing; Document representation; Document Index Graph;
D O I
10.1007/s10115-003-0118-5
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Document clustering techniques mostly rely on single term analysis of text, such as the vector space model. To better capture the structure of documents, the underlying data model should be able to represent the phrases in the document as well as single terms. We present a novel data model, the Document Index Graph, which indexes Web documents based on phrases rather than on single terms only. The semistructured Web documents help in identifying potential phrases that when matched with other documents indicate strong similarity between the documents. The Document Index Graph captures this information, and finding significant matching phrases between documents becomes easy and efficient with such model. The model is flexible in that it could revert to a compact representation of the vector space model if we choose not to index phrases. However, using phrase indexing yields more accurate document similarity calculations. The similarity between documents is based on both single term weights and matching phrase weights. The combined similarities are used with standard document clustering techniques to test their effect on the clustering quality. Experimental results show that our phrase-based similarity, combined with single-term similarity measures, gives a more accurate measure of document similarity and thus significantly enhances Web document clustering quality.
引用
收藏
页码:710 / 727
页数:18
相关论文
共 50 条
  • [11] Creating a Phrase Similarity Graph From Wikipedia
    Stanchev, Lubomir
    2014 IEEE INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC), 2014, : 68 - 75
  • [12] Hybrid distance based document clustering with keyword and phrase indexing
    Subhadra, K.
    Shashi, M.
    International Journal of Computer Science Issues, 2012, 9 (02): : 345 - 350
  • [13] A Novel Document and Query Similarity Indexing using VSM for Unstructured Documents
    Reshma, P. K.
    Rajagopal, Suharshala
    Lajish, V. L.
    2020 6TH INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING AND COMMUNICATION SYSTEMS (ICACCS), 2020, : 676 - 681
  • [14] Efficient phrase-based document similarity for clustering
    Chim, Hung
    Deng, Xiaotie
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2008, 20 (09) : 1217 - 1229
  • [15] THE EFFECTIVENESS OF A NONSYNTACTIC APPROACH TO AUTOMATIC PHRASE INDEXING FOR DOCUMENT-RETRIEVAL
    FAGAN, JL
    JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 1989, 40 (02): : 115 - 132
  • [16] Using fuzzy-word correlation factors to compute document similarity based on phrase matching
    Lee, Jun won
    Ng, Yiu-Kai
    FOURTH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, VOL 2, PROCEEDINGS, 2007, : 186 - +
  • [17] Document clustering based on similarity of subjects using integrated subject graph
    Nakada, M
    Osana, Y
    PROCEEDINGS OF THE IASTED INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND APPLICATIONS, 2006, : 410 - +
  • [18] Clustering Blogs Using Document Context Similarity and Spectral Graph Partitioning
    Ayyasamy, Ramesh Kumar
    Alhashmi, Saadat M.
    Eu-Gene, Siew
    Tahayna, Bashar
    KNOWLEDGE ENGINEERING AND MANAGEMENT, 2011, 123 : 475 - +
  • [19] Geometric Graph Indexing for Similarity Search in Scientific Databases
    Armiti, Ayser
    Gertz, Michael
    28TH INTERNATIONAL CONFERENCE ON SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT (SSDBM) 2016), 2016,
  • [20] LINKED PHRASE INDEXING
    CRAVEN, TC
    INFORMATION PROCESSING & MANAGEMENT, 1978, 14 (06) : 469 - 476