Efficient document similarity detection using weighted phrase indexing

被引:0
|
作者
Niyigena P. [1 ]
Zuping Z. [1 ]
Khuhro M.A. [1 ]
Hanyurwimfura D. [2 ]
机构
[1] School of Information Science and Engineering, Central South University, Changsha
[2] College of Science and Technology, University of Rwanda, Kigali
来源
| 1600年 / Science and Engineering Research Support Society卷 / 11期
基金
高等学校博士学科点专项科研基金; 中国国家自然科学基金;
关键词
Document similarity algorithm; Efficiency; Pairwise similarity; Phrase indexing;
D O I
10.14257/ijmue.2016.11.5.21
中图分类号
学科分类号
摘要
Document similarity techniques mostly rely on single term analysis of the document in the data set. To improve the efficiency and effectiveness of the process of document similarity detection, more informative feature terms have been developed and presented by many researchers. In this paper, we present phrase weight index, which indexes documents in the data set based on important phrases. Phrasal indexing aims to reduce the ambiguity inherent to the words considered in isolation, and then improve the effectiveness in document similarity computation. The method we are presenting here in this paper inherit the term tf-idf weighting scheme in computing important phrases in the collection. It computes the weight of phrases in the document collection and according to a given threshold; the important phrases are identified and are indexed. The data dimensionality which hinders the performance of document similarity for different methods is solved by an offline index creation of important phrases for every document. The evaluation experiments indicate that the presented method is very effective on document similarity detection and its quality surpasses the traditional phrase-based approach in which the reduction of dimensionality is ignored and other methods which use single-word tf-idf. © 2016 SERSC.
引用
收藏
页码:231 / 244
页数:13
相关论文
共 50 条
  • [1] Document Similarity Using a Phrase Indexing Graph Model
    Hammouda, Khaled M.
    Kamel, Mohamed S.
    KNOWLEDGE AND INFORMATION SYSTEMS, 2004, 6 (06) : 710 - 727
  • [2] Document Similarity Using a Phrase Indexing Graph Model
    Khaled M. Hammouda
    Mohamed S. Kamel
    Knowledge and Information Systems, 2004, 6 : 710 - 727
  • [3] Efficient phrase-based document indexing for web document clustering
    Hammouda, KM
    Kamel, MS
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2004, 16 (10) : 1279 - 1296
  • [4] Efficient phrase-based document similarity for clustering
    Chim, Hung
    Deng, Xiaotie
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2008, 20 (09) : 1217 - 1229
  • [5] Improving plagiarism detection in text document using hybrid weighted similarity
    Arabi, Hamed
    Akbari, Mehdi
    EXPERT SYSTEMS WITH APPLICATIONS, 2022, 207
  • [6] An Efficient Document Indexing-Based Similarity Search in Large Datasets
    Trong Nhan Phan
    Jaeger, Markus
    Nadschlaeger, Stefan
    Kueng, Josef
    Tran Khanh Dang
    FUTURE DATA AND SECURITY ENGINEERING, FDSE 2015, 2015, 9446 : 16 - 31
  • [7] Phrase Based Web Document Clustering: An Indexing Approach
    Singh, Amit Prakash
    Srivastava, Shalini
    Sahu, Sanjib Kumar
    COMPUTER COMMUNICATION, NETWORKING AND INTERNET SECURITY, 2017, 5 : 481 - 492
  • [8] Document Retrieval using Efficient Indexing Techniques: A Review
    Gupta, Shweta
    Yadav, Sunita
    Prasad, Rajesh
    INTERNATIONAL JOURNAL OF BUSINESS ANALYTICS, 2016, 3 (04) : 64 - 82
  • [9] Hybrid distance based document clustering with keyword and phrase indexing
    Subhadra, K.
    Shashi, M.
    International Journal of Computer Science Issues, 2012, 9 (02): : 345 - 350
  • [10] A Novel Document and Query Similarity Indexing using VSM for Unstructured Documents
    Reshma, P. K.
    Rajagopal, Suharshala
    Lajish, V. L.
    2020 6TH INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING AND COMMUNICATION SYSTEMS (ICACCS), 2020, : 676 - 681