Leveraging deletion neighborhoods and trie for efficient string similarity search and join

被引:1
|
作者
Cui, Jia [1 ,3 ]
Meng, Dan [2 ]
Chen, Zhong-Tao [1 ,3 ]
机构
[1] Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
[2] Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
[3] University of Chinese Academy of Sciences, Beijing, China
关键词
Deletion neighborhoods - Edits distance - Similarity join - Similarity search - Trie;
D O I
10.1007/978-3-319-12844-3_1
中图分类号
学科分类号
摘要
String similarity search and joins are primitive operations in database and information retrieval to address the poor data quality problem. Due to the high complexity of deletion neighborhoods, existing methods resort to hashing schemes to achieve reduction in space requirement of the index. However the introduced hash collisions need to be verified by the costly edit distance computation. In this paper, we focus on achieving a faster query speed with affordable memory consumptions. We propose a novel method that leverages the power of deletion neighborhoods and trie to answer the edit distance based string similarity query efficiently. We utilize the trie to share common prefixes of deletion neighborhoods and propose subtree merging optimization to reduce the index size. Then the index partition strategies are discussed and bit vector based verification method is proposed to speed up the query. The experimental results show that our method outperforms state-of-art methods on real dataset. © Springer International Publishing Switzerland 2014.
引用
下载
收藏
相关论文
共 50 条
  • [41] Efficient subgraph join based on connectivity similarity
    Yue Wang
    Hongzhi Wang
    Jianzhong Li
    Hong Gao
    World Wide Web, 2015, 18 : 871 - 887
  • [42] I/O-Efficient Similarity Join
    Paghl, Rasmus
    Phaml, Ninh
    Silvestril, Francesco
    Stockel, Morten
    ALGORITHMS - ESA 2015, 2015, 9294 : 941 - 952
  • [43] I/O-Efficient Similarity Join
    Pagh, Rasmus
    Pham, Ninh
    Silvestri, Francesco
    Stockel, Morten
    ALGORITHMICA, 2017, 78 (04) : 1263 - 1283
  • [44] I/O-Efficient Similarity Join
    Rasmus Pagh
    Ninh Pham
    Francesco Silvestri
    Morten Stöckel
    Algorithmica, 2017, 78 : 1263 - 1283
  • [45] Extending the Bag Distance for String Similarity Search
    Mergen S.
    SN Computer Science, 4 (2)
  • [46] b-Bit Sketch Trie: Scalable Similarity Search on Integer Sketches
    Kanda, Shunsuke
    Tabei, Yasuo
    2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2019, : 810 - 819
  • [47] Efficient algorithms for similarity search
    Rajasekaran, S
    Hu, Y
    Luo, J
    Nick, H
    Pardalos, PM
    Sahni, S
    Shaw, G
    JOURNAL OF COMBINATORIAL OPTIMIZATION, 2001, 5 (01) : 125 - 132
  • [48] Efficient Algorithms for Similarity Search
    S. Rajasekaran
    Y. Hu
    J. Luo
    H. Nick
    P.M. Pardalos
    S. Sahni
    G. Shaw
    Journal of Combinatorial Optimization, 2001, 5 : 125 - 132
  • [49] An efficient MapReduce algorithm for similarity join in metric spaces
    Liu, Wen
    Shen, Yanming
    Wang, Peng
    JOURNAL OF SUPERCOMPUTING, 2016, 72 (03): : 1179 - 1200
  • [50] Efficient graph similarity join for information integration on graphs
    Yue WANG
    Hongzhi WANG
    Jianzhong LI
    Hong GAO
    Frontiers of Computer Science, 2016, 10 (02) : 317 - 329