Efficient Exact Similarity Searches using Multiple Token Orderings

被引：12

作者：

Kim, Jongik ^{[1
]}

Lee, Hongrae ^{[2
]}

机构：

[1] Chonbuk Natl Univ, Div Comp Sci & Engn, 567 Baekje Daero, Jeonju, South Korea

[2] Google Inc, Mountain View, CA 94043 USA

来源：

2012 IEEE 28TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE) | 2012年

基金：

新加坡国家研究基金会;

关键词：

D O I：

10.1109/ICDE.2012.79

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Similarity searches are essential in many applications including data cleaning and near duplicate detection. Many similarity search algorithms first generate candidate records, and then identify true matches among them. A major focus of those algorithms has been on how to reduce the number of candidate records in the early stage of similarity query processing. One of the most commonly used techniques to reduce the candidate size is the prefix filtering principle, which exploits the document frequency ordering of tokens. In this paper, we propose a novel partitioning technique that considers multiple token orderings based on token co-occurrence statistics. Experimental results show that the proposed technique is effective in reducing the number of candidate records and as a result improves the performance of existing algorithms significantly.

引用

页码：822 / 833

页数：12

共 50 条

[1] Exact Score Distribution Computation for Similarity Searches in Ontologies
Schulz, Marcel H.
Koehler, Sebastian
Bauer, Sebastian
Vingron, Martin
Robinson, Peter N.
[J]. ALGORITHMS IN BIOINFORMATICS, PROCEEDINGS, 2009, 5724 : 298 - +
[2] Exact score distribution computation for ontological similarity searches
Schulz, Marcel H.
Koehler, Sebastian
Bauer, Sebastian
Robinson, Peter N.
[J]. BMC BIOINFORMATICS, 2011, 12
[3] Exact score distribution computation for ontological similarity searches
Marcel H Schulz
Sebastian Köhler
Sebastian Bauer
Peter N Robinson
[J]. BMC Bioinformatics, 12
[4] An Efficient Framework for Exact Set Similarity Search using Tree Structure Indexes
Zhang, Yong
Li, Xiuxing
Wang, Jin
Zhang, Ying
Xing, Chunxiao
Yuan, Xiaojie
[J]. 2017 IEEE 33RD INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2017), 2017, : 759 - 770
[5] Efficient structure similarity searches: a partition-based approach
Xiang Zhao
Chuan Xiao
Xuemin Lin
Wenjie Zhang
Yang Wang
[J]. The VLDB Journal, 2018, 27 : 53 - 78
[6] Efficient structure similarity searches: a partition-based approach
Zhao, Xiang
Xiao, Chuan
Lin, Xuemin
Zhang, Wenjie
Wang, Yang
[J]. VLDB JOURNAL, 2018, 27 (01): : 53 - 78
[7] Distance Threshold Similarity Searches: Efficient Trajectory Indexing on the GPU
Gowanlock, Michael
Casanova, Henri
[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2016, 27 (09) : 2533 - 2545
[8] Indexing of Spatiotemporal Trajectories for Efficient Distance Threshold Similarity Searches on the GPU
Gowanlock, Michael
Casanova, Henri
[J]. 2015 IEEE 29TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2015, : 387 - 396
[9] Protein sequence similarity searches using patterns as seeds
Zhang, Z
Schaffer, AA
Miller, W
Madden, TL
Lipman, DJ
Koonin, EV
Altschul, SF
[J]. NUCLEIC ACIDS RESEARCH, 1998, 26 (17) : 3986 - 3990
[10] An Efficient Partition Based Method for Exact Set Similarity Joins
Deng, Dong
Li, Guoliang
Wen, He
Feng, Jianhua
[J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2015, 9 (04): : 360 - 371

← 1 2 3 4 5 →