Document Similarity Using a Phrase Indexing Graph Model

被引:20
|
作者
Hammouda, Khaled M. [1 ]
Kamel, Mohamed S. [1 ]
机构
[1] Univ Waterloo, Dept Syst Design Engn, Waterloo, ON N2L 3G1, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
Document similarity; Phrase matching; Phrase indexing; Document representation; Document Index Graph;
D O I
10.1007/s10115-003-0118-5
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Document clustering techniques mostly rely on single term analysis of text, such as the vector space model. To better capture the structure of documents, the underlying data model should be able to represent the phrases in the document as well as single terms. We present a novel data model, the Document Index Graph, which indexes Web documents based on phrases rather than on single terms only. The semistructured Web documents help in identifying potential phrases that when matched with other documents indicate strong similarity between the documents. The Document Index Graph captures this information, and finding significant matching phrases between documents becomes easy and efficient with such model. The model is flexible in that it could revert to a compact representation of the vector space model if we choose not to index phrases. However, using phrase indexing yields more accurate document similarity calculations. The similarity between documents is based on both single term weights and matching phrase weights. The combined similarities are used with standard document clustering techniques to test their effect on the clustering quality. Experimental results show that our phrase-based similarity, combined with single-term similarity measures, gives a more accurate measure of document similarity and thus significantly enhances Web document clustering quality.
引用
收藏
页码:710 / 727
页数:18
相关论文
共 50 条
  • [41] A Keyphrase Graph-Based Method for Document Similarity Measurement
    Huynh, ThanhThuong T.
    TruongAn PhamNguyen
    Do, Nhon, V
    ENGINEERING LETTERS, 2022, 30 (02) : 692 - 710
  • [42] Learning Similarity Functions in Graph-Based Document Summarization
    Ouyang, You
    Li, Wenjie
    Wei, Furu
    Lu, Qin
    COMPUTER PROCESSING OF ORIENTAL LANGUAGES: LANGUAGE TECHNOLOGY FOR THE KNOWLEDGE-BASED ECONOMY, 2009, 5459 : 189 - 200
  • [43] A measure based on optimal matching in graph theory for document similarity
    Wan, XJ
    Peng, YX
    INFORMATION RETRIEVAL TECHNOLOGY, 2005, 3411 : 227 - 238
  • [44] Graph-based Similarity for Document Retrieval in the Biomedical Domain
    Zuluaga, Adelaida A.
    Rosso, Andres A.
    PROCEEDINGS OF 2022 7TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING TECHNOLOGIES, ICMLT 2022, 2022, : 180 - 184
  • [45] Document Similarity Calculation Model of CSLN
    Chen, Weiling
    Wang, Gang
    Yin, Fengxia
    2014 5TH IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING AND SERVICE SCIENCE (ICSESS), 2014, : 859 - 862
  • [46] PIRC Net: Using Proposal Indexing, Relationships and Context for Phrase Grounding
    Kovvuri, Rama
    Nevatia, Ram
    COMPUTER VISION - ACCV 2018, PT IV, 2019, 11364 : 451 - 467
  • [47] A phrase similarity-based model for statistical machine translation
    He, Zhongjun
    Liu, Qun
    Lin, Shouxun
    Gaojishu Tongxin/Chinese High Technology Letters, 2009, 19 (04): : 337 - 341
  • [48] Pairwise Similarity Propagation Based Graph Clustering for Scalable Object Indexing and Retrieval
    Xia, Shengping
    Hancock, Edwin R.
    GRAPH-BASED REPRESENTATIONS IN PATTERN RECOGNITION, PROCEEDINGS, 2009, 5534 : 184 - +
  • [49] Semantic key phrase-based model for document management
    Bafna, Prafulla
    Pramod, Dhanya
    Shrwaikar, Shailaja
    Hassan, Atiya
    BENCHMARKING-AN INTERNATIONAL JOURNAL, 2019, 26 (06) : 1709 - 1727
  • [50] FAST SIMILARITY SEARCH ON A LARGE SPEECH DATA SET WITH NEIGHBORHOOD GRAPH INDEXING
    Aoyama, Kazuo
    Watanabe, Shinji
    Sawada, Hiroshi
    Minami, Yasuhiro
    Ueda, Naonori
    Saito, Kazumi
    2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 5358 - 5361