Efficient document similarity detection using weighted phrase indexing

被引:0
|
作者
Niyigena P. [1 ]
Zuping Z. [1 ]
Khuhro M.A. [1 ]
Hanyurwimfura D. [2 ]
机构
[1] School of Information Science and Engineering, Central South University, Changsha
[2] College of Science and Technology, University of Rwanda, Kigali
来源
| 1600年 / Science and Engineering Research Support Society卷 / 11期
基金
高等学校博士学科点专项科研基金; 中国国家自然科学基金;
关键词
Document similarity algorithm; Efficiency; Pairwise similarity; Phrase indexing;
D O I
10.14257/ijmue.2016.11.5.21
中图分类号
学科分类号
摘要
Document similarity techniques mostly rely on single term analysis of the document in the data set. To improve the efficiency and effectiveness of the process of document similarity detection, more informative feature terms have been developed and presented by many researchers. In this paper, we present phrase weight index, which indexes documents in the data set based on important phrases. Phrasal indexing aims to reduce the ambiguity inherent to the words considered in isolation, and then improve the effectiveness in document similarity computation. The method we are presenting here in this paper inherit the term tf-idf weighting scheme in computing important phrases in the collection. It computes the weight of phrases in the document collection and according to a given threshold; the important phrases are identified and are indexed. The data dimensionality which hinders the performance of document similarity for different methods is solved by an offline index creation of important phrases for every document. The evaluation experiments indicate that the presented method is very effective on document similarity detection and its quality surpasses the traditional phrase-based approach in which the reduction of dimensionality is ignored and other methods which use single-word tf-idf. © 2016 SERSC.
引用
收藏
页码:231 / 244
页数:13
相关论文
共 50 条
  • [31] Efficient Indexing of Similarity Models with Inequality Symbolic Regression
    Bartos, Tomas
    Skopal, Tomas
    Mosko, Juraj
    GECCO'13: PROCEEDINGS OF THE 2013 GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE, 2013, : 901 - 908
  • [32] Towards Efficient Indexing of Arbitrary Similarity [Vision paper]
    Bartos, Tomas
    Skopal, Tomas
    Mosko, Juraj
    SIGMOD RECORD, 2013, 42 (02) : 5 - 10
  • [33] Document image similarity and equivalence detection
    Jonathan J. Hull
    International Journal on Document Analysis and Recognition, 1998, 1 (1) : 37 - 42
  • [34] Document image similarity and equivalence detection
    Hull, JJ
    Cullen, JF
    PROCEEDINGS OF THE FOURTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS 1 AND 2, 1997, : 308 - 312
  • [35] SQLiDDS: SQL Injection Detection Using Query Transformation and Document Similarity
    Kar, Debabrata
    Panigrahi, Suvasini
    Sundararajan, Srikanth
    DISTRIBUTED COMPUTING AND INTERNET TECHNOLOGY, ICDCIT 2015, 2015, 8956 : 377 - 390
  • [36] Document Similarity Detection using K-Means and Cosine Distance
    Usino, Wendi
    Prabuwono, Anton Satria
    Allehaibi, Khalid Hamed S.
    Bramantoro, Arif
    Hasniaty, A.
    Amaldi, Wahyu
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2019, 10 (02) : 165 - 170
  • [37] Document similarity detection using K-Means and cosine distance
    Usino W.
    Prabuwono A.S.
    Allehaibi K.H.S.
    Bramantoro A.
    Hasniaty A.
    Amaldi W.
    Intl. J. Adv. Comput. Sci. Appl., 2 (165-170): : 165 - 170
  • [38] Efficient Graph-Based Document Similarity
    Paul, Christian
    Rettinger, Achim
    Mogadala, Aditya
    Knoblock, Craig A.
    Szekely, Pedro
    SEMANTIC WEB: LATEST ADVANCES AND NEW DOMAINS, 2016, 9678 : 334 - 349
  • [39] Video indexing and similarity retrieval by largest common subgraph detection using decision trees
    Shearer, K
    Bunke, H
    Venkatesh, S
    PATTERN RECOGNITION, 2001, 34 (05) : 1075 - 1091
  • [40] Using noun phrase heads to extract document keyphrases
    Barker, K
    Cornacchia, N
    ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2000, 1822 : 40 - 52