Efficient document similarity detection using weighted phrase indexing

被引：0

作者：

Niyigena P. ^{[1
]}

Zuping Z. ^{[1
]}

Khuhro M.A. ^{[1
]}

Hanyurwimfura D. ^{[2
]}

机构：

[1] School of Information Science and Engineering, Central South University, Changsha

[2] College of Science and Technology, University of Rwanda, Kigali

来源：

| 1600年 / Science and Engineering Research Support Society卷 / 11期

基金：

高等学校博士学科点专项科研基金; 中国国家自然科学基金;

关键词：

Document similarity algorithm; Efficiency; Pairwise similarity; Phrase indexing;

D O I：

10.14257/ijmue.2016.11.5.21

中图分类号：

学科分类号：

摘要：

Document similarity techniques mostly rely on single term analysis of the document in the data set. To improve the efficiency and effectiveness of the process of document similarity detection, more informative feature terms have been developed and presented by many researchers. In this paper, we present phrase weight index, which indexes documents in the data set based on important phrases. Phrasal indexing aims to reduce the ambiguity inherent to the words considered in isolation, and then improve the effectiveness in document similarity computation. The method we are presenting here in this paper inherit the term tf-idf weighting scheme in computing important phrases in the collection. It computes the weight of phrases in the document collection and according to a given threshold; the important phrases are identified and are indexed. The data dimensionality which hinders the performance of document similarity for different methods is solved by an offline index creation of important phrases for every document. The evaluation experiments indicate that the presented method is very effective on document similarity detection and its quality surpasses the traditional phrase-based approach in which the reduction of dimensionality is ignored and other methods which use single-word tf-idf. © 2016 SERSC.

引用

页码：231 / 244

页数：13

共 50 条

[21] Learning Phrase Patterns for ASR Name Error Detection Using Semantic Similarity
Marin, Alex
Ostendod, Mari
He, Ji
16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 1423 - 1427
[22] RELATIVE INDEXING - WEIGHTED DESCRIPTORS AND RELATIVE INDEXING IN A DOCUMENT-RETRIEVAL SYSTEM MODEL
CHOROS, K
DANILOWICZ, C
INFORMATION PROCESSING & MANAGEMENT, 1982, 18 (04) : 207 - 220
[23] Efficient Incremental Phrase-Based Document Clustering
Bakr, Ahmad M.
Yousri, Noha A.
Ismail, Mohamed A.
2012 21ST INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR 2012), 2012, : 517 - 520
[24] A novel weighted phrase-based similarity for Web documents clustering
Yang R.
Zhu Q.
Xia Y.
Journal of Software, 2011, 6 (08) : 1521 - 1528
[25] SQLiDDS: SQL injection detection using document similarity measure
Kar, Debabrata
Panigrahi, Suvasini
Sundararajan, Srikanth
JOURNAL OF COMPUTER SECURITY, 2016, 24 (04) : 507 - 539
[26] Plagiarism detection using document similarity based on distributed representation
Baba, Kensuke
Nakatoh, Tetsuya
Minami, Toshiro
8TH INTERNATIONAL CONFERENCE ON ADVANCES IN INFORMATION TECHNOLOGY, 2017, 111 : 382 - 387
[27] PH-SSBM: Phrase Semantic Similarity Based Model for Document Clustering
Gad, Walaa K.
Kamel, Mohamed S.
2009 SECOND INTERNATIONAL SYMPOSIUM ON KNOWLEDGE ACQUISITION AND MODELING: KAM 2009, VOL 2, 2009, : 197 - 200
[28] Measuring document similarity with weighted averages of word embeddings
Seegmiller, Bryan
Papanikolaou, Dimitris
Schmidt, Lawrence D. W.
EXPLORATIONS IN ECONOMIC HISTORY, 2023, 87
[29] Efficient and flexible bitmap indexing for complex similarity queries
Cha, GH
DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, 2004, 2973 : 708 - 720
[30] Efficient Similarity Search by Combining Indexing and Caching Strategies
Brisaboa, Nieves R.
Cerdeira-Pena, Ana
Gil-Costa, Veronica
Marin, Mauricio
Pedreira, Oscar
SOFSEM 2015: THEORY AND PRACTICE OF COMPUTER SCIENCE, 2015, 8939 : 486 - 497

← 1 2 3 4 5 →