Document Similarity Using a Phrase Indexing Graph Model

被引:20
|
作者
Hammouda, Khaled M. [1 ]
Kamel, Mohamed S. [1 ]
机构
[1] Univ Waterloo, Dept Syst Design Engn, Waterloo, ON N2L 3G1, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
Document similarity; Phrase matching; Phrase indexing; Document representation; Document Index Graph;
D O I
10.1007/s10115-003-0118-5
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Document clustering techniques mostly rely on single term analysis of text, such as the vector space model. To better capture the structure of documents, the underlying data model should be able to represent the phrases in the document as well as single terms. We present a novel data model, the Document Index Graph, which indexes Web documents based on phrases rather than on single terms only. The semistructured Web documents help in identifying potential phrases that when matched with other documents indicate strong similarity between the documents. The Document Index Graph captures this information, and finding significant matching phrases between documents becomes easy and efficient with such model. The model is flexible in that it could revert to a compact representation of the vector space model if we choose not to index phrases. However, using phrase indexing yields more accurate document similarity calculations. The similarity between documents is based on both single term weights and matching phrase weights. The combined similarities are used with standard document clustering techniques to test their effect on the clustering quality. Experimental results show that our phrase-based similarity, combined with single-term similarity measures, gives a more accurate measure of document similarity and thus significantly enhances Web document clustering quality.
引用
收藏
页码:710 / 727
页数:18
相关论文
共 50 条
  • [1] Document Similarity Using a Phrase Indexing Graph Model
    Khaled M. Hammouda
    Mohamed S. Kamel
    Knowledge and Information Systems, 2004, 6 : 710 - 727
  • [2] Efficient document similarity detection using weighted phrase indexing
    Niyigena P.
    Zuping Z.
    Khuhro M.A.
    Hanyurwimfura D.
    1600, Science and Engineering Research Support Society (11): : 231 - 244
  • [3] Phrase-based document similarity based on an Index Graph model
    Hammouda, KM
    Kamel, MS
    2002 IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2002, : 203 - 210
  • [4] Concept based document similarity using graph model
    Sonawane S.S.
    Kulkarni P.
    International Journal of Information Technology, 2022, 14 (1) : 311 - 322
  • [5] Efficient phrase-based document indexing for web document clustering
    Hammouda, KM
    Kamel, MS
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2004, 16 (10) : 1279 - 1296
  • [6] Semantic Document Clustering Using a Similarity Graph
    Stanchev, Lubomir
    2016 IEEE TENTH INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC), 2016, : 1 - 8
  • [7] Phrase Based Web Document Clustering: An Indexing Approach
    Singh, Amit Prakash
    Srivastava, Shalini
    Sahu, Sanjib Kumar
    COMPUTER COMMUNICATION, NETWORKING AND INTERNET SECURITY, 2017, 5 : 481 - 492
  • [8] MEASURING GRAPH SIMILARITY USING NODE INDEXING AND MESSAGE PASSING
    Shen, Gang
    Li, Wei
    2011 3RD INTERNATIONAL CONFERENCE ON COMPUTER TECHNOLOGY AND DEVELOPMENT (ICCTD 2011), VOL 2, 2012, : 787 - 792
  • [9] PH-SSBM: Phrase Semantic Similarity Based Model for Document Clustering
    Gad, Walaa K.
    Kamel, Mohamed S.
    2009 SECOND INTERNATIONAL SYMPOSIUM ON KNOWLEDGE ACQUISITION AND MODELING: KAM 2009, VOL 2, 2009, : 197 - 200
  • [10] Metric Indexing for Graph Similarity Search
    Bause, Franka
    Blumenthal, David B.
    Schubert, Erich
    Kriege, Nils M.
    SIMILARITY SEARCH AND APPLICATIONS, SISAP 2021, 2021, 13058 : 323 - 336