Efficient document similarity detection using weighted phrase indexing

被引:0
|
作者
Niyigena P. [1 ]
Zuping Z. [1 ]
Khuhro M.A. [1 ]
Hanyurwimfura D. [2 ]
机构
[1] School of Information Science and Engineering, Central South University, Changsha
[2] College of Science and Technology, University of Rwanda, Kigali
来源
| 1600年 / Science and Engineering Research Support Society卷 / 11期
基金
高等学校博士学科点专项科研基金; 中国国家自然科学基金;
关键词
Document similarity algorithm; Efficiency; Pairwise similarity; Phrase indexing;
D O I
10.14257/ijmue.2016.11.5.21
中图分类号
学科分类号
摘要
Document similarity techniques mostly rely on single term analysis of the document in the data set. To improve the efficiency and effectiveness of the process of document similarity detection, more informative feature terms have been developed and presented by many researchers. In this paper, we present phrase weight index, which indexes documents in the data set based on important phrases. Phrasal indexing aims to reduce the ambiguity inherent to the words considered in isolation, and then improve the effectiveness in document similarity computation. The method we are presenting here in this paper inherit the term tf-idf weighting scheme in computing important phrases in the collection. It computes the weight of phrases in the document collection and according to a given threshold; the important phrases are identified and are indexed. The data dimensionality which hinders the performance of document similarity for different methods is solved by an offline index creation of important phrases for every document. The evaluation experiments indicate that the presented method is very effective on document similarity detection and its quality surpasses the traditional phrase-based approach in which the reduction of dimensionality is ignored and other methods which use single-word tf-idf. © 2016 SERSC.
引用
收藏
页码:231 / 244
页数:13
相关论文
共 50 条
  • [11] THE EFFECTIVENESS OF A NONSYNTACTIC APPROACH TO AUTOMATIC PHRASE INDEXING FOR DOCUMENT-RETRIEVAL
    FAGAN, JL
    JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 1989, 40 (02): : 115 - 132
  • [12] Efficient Metric Indexing for Similarity Search and Similarity Joins
    Chen, Lu
    Gao, Yunjun
    Li, Xinhan
    Jensen, Christian S.
    Chen, Gang
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2017, 29 (03) : 556 - 571
  • [13] Improvement tfidf for news document using efficient similarity
    Elahi, Abdolkarim
    Alitappeh, Reza Javanmard
    Rostami, Ali Shokouhi
    Research Journal of Applied Sciences, Engineering and Technology, 2012, 4 (19) : 3592 - 3600
  • [14] Efficient indexing of versioned document sequences
    Herscovici, Michael
    Lempel, Ronny
    Yogev, Sivan
    ADVANCES IN INFORMATION RETRIEVAL, 2007, 4425 : 76 - +
  • [15] Efficient Metric Indexing for Similarity Search
    Chen, Lu
    Gao, Yunjun
    Li, Xinhan
    Jensen, Christian S.
    Chen, Gang
    2015 IEEE 31ST INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2015, : 591 - 602
  • [16] Indexing weighted sequences: Neat and efficient
    Barton, Carl
    Kociumaka, Tomasz
    Liu, Chang
    Pissis, Solon P.
    Radoszewski, Jakub
    INFORMATION AND COMPUTATION, 2020, 270
  • [17] Using fuzzy-word correlation factors to compute document similarity based on phrase matching
    Lee, Jun won
    Ng, Yiu-Kai
    FOURTH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, VOL 2, PROCEEDINGS, 2007, : 186 - +
  • [18] Weighted Similarity: A New Similarity Measure for Document Ranking Features
    Shirzad, Mehrnoush Barani
    Keyvanpour, Mohammad Reza
    ARTIFICIAL INTELLIGENCE TRENDS IN INTELLIGENT SYSTEMS, CSOC2017, VOL 1, 2017, 573 : 273 - 280
  • [19] An efficient document retrieval method using n-gram indexing
    Ogawa, Yasushi
    Matsuda, Toru
    Systems and Computers in Japan, 2002, 33 (02) : 54 - 63
  • [20] Phrase-based document similarity based on an Index Graph model
    Hammouda, KM
    Kamel, MS
    2002 IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2002, : 203 - 210