Document Similarity Using a Phrase Indexing Graph Model

被引:20
|
作者
Hammouda, Khaled M. [1 ]
Kamel, Mohamed S. [1 ]
机构
[1] Univ Waterloo, Dept Syst Design Engn, Waterloo, ON N2L 3G1, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
Document similarity; Phrase matching; Phrase indexing; Document representation; Document Index Graph;
D O I
10.1007/s10115-003-0118-5
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Document clustering techniques mostly rely on single term analysis of text, such as the vector space model. To better capture the structure of documents, the underlying data model should be able to represent the phrases in the document as well as single terms. We present a novel data model, the Document Index Graph, which indexes Web documents based on phrases rather than on single terms only. The semistructured Web documents help in identifying potential phrases that when matched with other documents indicate strong similarity between the documents. The Document Index Graph captures this information, and finding significant matching phrases between documents becomes easy and efficient with such model. The model is flexible in that it could revert to a compact representation of the vector space model if we choose not to index phrases. However, using phrase indexing yields more accurate document similarity calculations. The similarity between documents is based on both single term weights and matching phrase weights. The combined similarities are used with standard document clustering techniques to test their effect on the clustering quality. Experimental results show that our phrase-based similarity, combined with single-term similarity measures, gives a more accurate measure of document similarity and thus significantly enhances Web document clustering quality.
引用
收藏
页码:710 / 727
页数:18
相关论文
共 50 条
  • [31] Similarity Search in Graph Databases: A Multi-layered Indexing Approach
    Liang, Yongjiang
    Zhao, Peixiang
    2017 IEEE 33RD INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2017), 2017, : 783 - 794
  • [32] Exploiting local similarity for indexing paths in graph-structured data
    Kaushik, R
    Shenoy, P
    Bohannon, P
    Gudes, E
    18TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 2002, : 129 - 140
  • [33] Model transformation verification using similarity and graph comparison algorithm
    Jong-Won Ko
    Kyung-Yong Chung
    Jung-Soo Han
    Multimedia Tools and Applications, 2015, 74 : 8907 - 8920
  • [34] Model transformation verification using similarity and graph comparison algorithm
    Ko, Jong-Won
    Chung, Kyung-Yong
    Han, Jung-Soo
    MULTIMEDIA TOOLS AND APPLICATIONS, 2015, 74 (20) : 8907 - 8920
  • [35] Using noun phrase heads to extract document keyphrases
    Barker, K
    Cornacchia, N
    ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2000, 1822 : 40 - 52
  • [36] Document Searching Engine Using Term Similarity Vector Space Model on English and Indonesian Document
    Handojo, Andreas
    Wibowo, Adi
    Ria, Yovita
    INTELLIGENCE IN THE ERA OF BIG DATA, ICSIIT 2015, 2015, 516 : 165 - 173
  • [37] Document clustering using locality preserving indexing
    Cai, D
    He, XF
    Han, JW
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2005, 17 (12) : 1624 - 1637
  • [38] Graph Based Automatic Document Summarization with Different Similarity Methods
    Kaynar, Oguz
    Isik, Yunus Emre
    Gormez, Yasin
    2017 25TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2017,
  • [39] Exploiting model similarity for indexing and matching to a large model database
    Tan, Yi
    Matei, Bogdan C.
    Sawliney, Harpreet
    COMPUTER VISION - ECCV 2006, PT 2, PROCEEDINGS, 2006, 3952 : 536 - 548
  • [40] Learning heterogeneous graph embedding for Chinese legal document similarity
    Bi, Sheng
    Ali, Zafar
    Wang, Meng
    Wu, Tianxing
    Qi, Guilin
    KNOWLEDGE-BASED SYSTEMS, 2022, 250