Efficient document similarity detection using weighted phrase indexing

被引:0
|
作者
Niyigena P. [1 ]
Zuping Z. [1 ]
Khuhro M.A. [1 ]
Hanyurwimfura D. [2 ]
机构
[1] School of Information Science and Engineering, Central South University, Changsha
[2] College of Science and Technology, University of Rwanda, Kigali
来源
| 1600年 / Science and Engineering Research Support Society卷 / 11期
基金
高等学校博士学科点专项科研基金; 中国国家自然科学基金;
关键词
Document similarity algorithm; Efficiency; Pairwise similarity; Phrase indexing;
D O I
10.14257/ijmue.2016.11.5.21
中图分类号
学科分类号
摘要
Document similarity techniques mostly rely on single term analysis of the document in the data set. To improve the efficiency and effectiveness of the process of document similarity detection, more informative feature terms have been developed and presented by many researchers. In this paper, we present phrase weight index, which indexes documents in the data set based on important phrases. Phrasal indexing aims to reduce the ambiguity inherent to the words considered in isolation, and then improve the effectiveness in document similarity computation. The method we are presenting here in this paper inherit the term tf-idf weighting scheme in computing important phrases in the collection. It computes the weight of phrases in the document collection and according to a given threshold; the important phrases are identified and are indexed. The data dimensionality which hinders the performance of document similarity for different methods is solved by an offline index creation of important phrases for every document. The evaluation experiments indicate that the presented method is very effective on document similarity detection and its quality surpasses the traditional phrase-based approach in which the reduction of dimensionality is ignored and other methods which use single-word tf-idf. © 2016 SERSC.
引用
收藏
页码:231 / 244
页数:13
相关论文
共 50 条
  • [41] Similarity Algorithm Based on Weighted Hierarchical Structure of XML Document
    Sun, Xia
    Cheng, Hong-Bin
    Wang, Hai-Jun
    2009 WASE INTERNATIONAL CONFERENCE ON INFORMATION ENGINEERING, ICIE 2009, VOL II, 2009, : 143 - +
  • [42] Selection of Best Match Keyword using Spoken Term Detection for Spoken Document Indexing
    Domoto, Kentaro
    Utsuro, Takehito
    Sawada, Naoki
    Nishizaki, Hiromitsu
    2014 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2014,
  • [43] Efficient Social Distancing Detection Using Object Detection and Triangle Similarity
    Zope, Vidya
    Joshi, Nikhil
    Iyengar, Srivatsan
    Mahadevan, Krish
    Singh, Meher
    ADVANCES IN COMPUTING AND DATA SCIENCES, PT I, 2021, 1440 : 81 - 89
  • [44] An efficient similarity search based on indexing in large DNA databases
    Jeong, In-Seon
    Park, Kyoung-Wook
    Kang, Seung-Ho
    Lim, Hyeong-Seok
    COMPUTATIONAL BIOLOGY AND CHEMISTRY, 2010, 34 (02) : 131 - 136
  • [45] Document clustering using locality preserving indexing
    Cai, D
    He, XF
    Han, JW
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2005, 17 (12) : 1624 - 1637
  • [46] Indexing for efficient spatial similarity query processing in multimedia databases
    Gudivada, VN
    MULTIMEDIA STORAGE AND ARCHIVING SYSTEMS, 1996, 2916 : 46 - 52
  • [47] Distance Threshold Similarity Searches: Efficient Trajectory Indexing on the GPU
    Gowanlock, Michael
    Casanova, Henri
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2016, 27 (09) : 2533 - 2545
  • [48] Hierarchical indexing structure for efficient similarity search in video retrieval
    Lu, Hong
    Ooi, Beng Chin
    Shen, Heng Tao
    Xue, Xiangyang
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2006, 18 (11) : 1544 - 1559
  • [49] Efficient algorithm for sequence similarity search based on reference indexing
    Dai D.-B.
    Xiong Y.
    Zhu Y.-Y.
    Ruan Jian Xue Bao/Journal of Software, 2010, 21 (04): : 718 - 731
  • [50] Indexing of Motion Capture Data for Efficient and Fast Similarity Search
    Li, Chuanjun
    Prabhakaran, B.
    JOURNAL OF COMPUTERS, 2006, 1 (03) : 35 - 42