Efficient Similarity Search in Very Large String Sets

被引:0
|
作者
Fenz, Dandy [1 ]
Lange, Dustin [1 ]
Rheinlaender, Astrid [2 ]
Naumann, Felix [1 ]
Leser, Ulf [2 ]
机构
[1] Hasso Plattner Inst, Potsdam, Germany
[2] Humboldt Univ, Dept Comp Sci, Berlin, Germany
关键词
ALGORITHM; JOIN;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
String similarity search is required by many real-life applications, such as spell checking, data cleansing, fuzzy keyword search, or comparison of DNA sequences. Given a very large string set and a query string, the string similarity search problem is to efficiently find all strings in the string set that are similar to the query string. Similarity is defined using a similarity (or distance) measure, such as edit distance or Hamming distance. In this paper, we introduce the State Set Index (SSI) as an efficient solution for this search problem. SSI is based on a trie (prefix index) that is interpreted as a nondeterministic finite automaton. SSI implements a novel state labeling strategy making the index highly space-efficient. Furthermore, SSI's space consumption can be gracefully traded against search time. We evaluated SSI on different sets of person names with up to 170 million strings from a social network and compared it to other state-of-theart methods. We show that in the majority of cases, SSI is significantly faster than other tools and requires less index space.
引用
收藏
页码:262 / 279
页数:18
相关论文
共 50 条
  • [1] Efficient Similarity Search for Sets over Graphs
    Wang, Yue
    Feng, Zonghao
    Chen, Lei
    Li, Zijian
    Jian, Xun
    Luo, Qiong
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2021, 33 (02) : 444 - 458
  • [2] Optimized Signature Selection for Efficient String Similarity Search
    Lee, Taegyoung
    Chung, Tae-Sun
    Kim, Jongik
    [J]. IEEE ACCESS, 2020, 8 : 98193 - 98204
  • [3] A Scalable Architecture for Image Similarity Search for Very Large Data Sets using Smart SSDs
    Alves, Vladimir
    Do, Jae Young
    [J]. 2018 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW 2018), 2018, : 1231 - 1231
  • [4] Highly Efficient String Similarity Search and Join over Compressed Indexes
    Xiao, Guorui
    Wang, Jin
    Lin, Chunbin
    Zaniolo, Carlo
    [J]. 2022 IEEE 38TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2022), 2022, : 232 - 244
  • [5] Leveraging Deletion Neighborhoods and Trie for Efficient String Similarity Search and Join
    Cui, Jia
    Meng, Dan
    Chen, Zhong-Tao
    [J]. INFORMATION RETRIEVAL TECHNOLOGY, AIRS 2014, 2014, 8870 : 1 - 13
  • [6] Efficient similarity search by summarization in large video database
    Zhou, Xiangmin
    Zhou, Xiaofang
    Shen, Heng Tao
    [J]. Conferences in Research and Practice in Information Technology Series, 2007, 63 : 161 - 167
  • [7] An Efficient Similarity Search in Large Data Collections with MapReduce
    Trong Nhan Phan
    Kueng, Josef
    Tran Khanh Dang
    [J]. FUTURE DATA AND SECURITY ENGINEERING, FDSE 2014, 2014, 8860 : 44 - 57
  • [8] Efficient similarity search for hierarchical data in large databases
    Kailing, K
    Kriegel, HP
    Schönauer, S
    Seidl, T
    [J]. ADVANCES IN DATABASE TECHNOLOGY - EDBT 2004, PROCEEDINGS, 2004, 2992 : 676 - 693
  • [9] Efficient similarity join of large sets of moving object trajectories
    Ding, Hui
    Trajcevski, Goce
    Scheuermann, Peter
    [J]. TIME 2008: 15TH INTERNATIONAL SYMPOSIUM ON TEMPORAL REPRESENTATION AND REASONING, PROCEEDINGS, 2008, : 79 - 87
  • [10] An efficient search scheme for very large image databases
    Pramanik, S
    Li, JH
    Ruan, JD
    Bhattacharjee, SK
    [J]. INTERNET IMAGING, 2000, 3964 : 79 - 90