Efficient Exact Similarity Searches using Multiple Token Orderings

被引:12
|
作者
Kim, Jongik [1 ]
Lee, Hongrae [2 ]
机构
[1] Chonbuk Natl Univ, Div Comp Sci & Engn, 567 Baekje Daero, Jeonju, South Korea
[2] Google Inc, Mountain View, CA 94043 USA
基金
新加坡国家研究基金会;
关键词
D O I
10.1109/ICDE.2012.79
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Similarity searches are essential in many applications including data cleaning and near duplicate detection. Many similarity search algorithms first generate candidate records, and then identify true matches among them. A major focus of those algorithms has been on how to reduce the number of candidate records in the early stage of similarity query processing. One of the most commonly used techniques to reduce the candidate size is the prefix filtering principle, which exploits the document frequency ordering of tokens. In this paper, we propose a novel partitioning technique that considers multiple token orderings based on token co-occurrence statistics. Experimental results show that the proposed technique is effective in reducing the number of candidate records and as a result improves the performance of existing algorithms significantly.
引用
收藏
页码:822 / 833
页数:12
相关论文
共 50 条
  • [1] Exact Score Distribution Computation for Similarity Searches in Ontologies
    Schulz, Marcel H.
    Koehler, Sebastian
    Bauer, Sebastian
    Vingron, Martin
    Robinson, Peter N.
    [J]. ALGORITHMS IN BIOINFORMATICS, PROCEEDINGS, 2009, 5724 : 298 - +
  • [2] Exact score distribution computation for ontological similarity searches
    Schulz, Marcel H.
    Koehler, Sebastian
    Bauer, Sebastian
    Robinson, Peter N.
    [J]. BMC BIOINFORMATICS, 2011, 12
  • [3] Exact score distribution computation for ontological similarity searches
    Marcel H Schulz
    Sebastian Köhler
    Sebastian Bauer
    Peter N Robinson
    [J]. BMC Bioinformatics, 12
  • [4] An Efficient Framework for Exact Set Similarity Search using Tree Structure Indexes
    Zhang, Yong
    Li, Xiuxing
    Wang, Jin
    Zhang, Ying
    Xing, Chunxiao
    Yuan, Xiaojie
    [J]. 2017 IEEE 33RD INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2017), 2017, : 759 - 770
  • [5] Efficient structure similarity searches: a partition-based approach
    Xiang Zhao
    Chuan Xiao
    Xuemin Lin
    Wenjie Zhang
    Yang Wang
    [J]. The VLDB Journal, 2018, 27 : 53 - 78
  • [6] Efficient structure similarity searches: a partition-based approach
    Zhao, Xiang
    Xiao, Chuan
    Lin, Xuemin
    Zhang, Wenjie
    Wang, Yang
    [J]. VLDB JOURNAL, 2018, 27 (01): : 53 - 78
  • [7] Distance Threshold Similarity Searches: Efficient Trajectory Indexing on the GPU
    Gowanlock, Michael
    Casanova, Henri
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2016, 27 (09) : 2533 - 2545
  • [8] Indexing of Spatiotemporal Trajectories for Efficient Distance Threshold Similarity Searches on the GPU
    Gowanlock, Michael
    Casanova, Henri
    [J]. 2015 IEEE 29TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2015, : 387 - 396
  • [9] Protein sequence similarity searches using patterns as seeds
    Zhang, Z
    Schaffer, AA
    Miller, W
    Madden, TL
    Lipman, DJ
    Koonin, EV
    Altschul, SF
    [J]. NUCLEIC ACIDS RESEARCH, 1998, 26 (17) : 3986 - 3990
  • [10] An Efficient Partition Based Method for Exact Set Similarity Joins
    Deng, Dong
    Li, Guoliang
    Wen, He
    Feng, Jianhua
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2015, 9 (04): : 360 - 371