String similarity search and join: a survey

被引:74
|
作者
Yu, Minghe [1 ]
Li, Guoliang [1 ]
Deng, Dong [1 ]
Feng, Jianhua [1 ]
机构
[1] Tsinghua Univ, Dept Comp Sci, Beijing 100084, Peoples R China
基金
中国国家自然科学基金;
关键词
string similarity; similarity search; similarity join; top-k; TRIE-BASED METHOD; FILTERING ALGORITHMS; EFFICIENT ALGORITHM; FRAMEWORK;
D O I
10.1007/s11704-015-5900-5
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
String similarity search and join are two important operations in data cleaning and integration, which extend traditional exact search and exact join operations in databases by tolerating the errors and inconsistencies in the data. They have many real-world applications, such as spell checking, duplicate detection, entity resolution, and webpage clustering. Although these two problems have been extensively studied in the recent decade, there is no thorough survey. In this paper, we present a comprehensive survey on string similarity search and join. We first give the problem definitions and introduce widely-used similarity functions to quantify the similarity. We then present an extensive set of algorithms for string similarity search and join. We also discuss their variants, including approximate entity extraction, type-ahead search, and approximate substring matching. Finally, we provide some open datasets and summarize some research challenges and open problems.
引用
收藏
页码:399 / 417
页数:19
相关论文
共 50 条
  • [1] String similarity search and join: a survey
    Minghe Yu
    Guoliang Li
    Dong Deng
    Jianhua Feng
    [J]. Frontiers of Computer Science, 2016, 10 : 399 - 417
  • [2] String similarity search and join:a survey
    Minghe YU
    Guoliang LI
    Dong DENG
    Jianhua FENG
    [J]. Frontiers of Computer Science, 2016, 10 (03) : 399 - 417
  • [3] State-of-the-art in String Similarity Search and Join
    Wandelt, Sebastian
    Deng, Dong
    Gerdjikov, Stefan
    Mishra, Shashwat
    Mitankin, Petar
    Patil, Manish
    Siragusa, Enrico
    Tiskin, Alexander
    Wang, Wei
    Wang, Jiaying
    Leser, Ulf
    [J]. SIGMOD RECORD, 2014, 43 (01) : 64 - 76
  • [4] Highly Efficient String Similarity Search and Join over Compressed Indexes
    Xiao, Guorui
    Wang, Jin
    Lin, Chunbin
    Zaniolo, Carlo
    [J]. 2022 IEEE 38TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2022), 2022, : 232 - 244
  • [6] Leveraging Deletion Neighborhoods and Trie for Efficient String Similarity Search and Join
    Cui, Jia
    Meng, Dan
    Chen, Zhong-Tao
    [J]. INFORMATION RETRIEVAL TECHNOLOGY, AIRS 2014, 2014, 8870 : 1 - 13
  • [7] Leveraging deletion neighborhoods and trie for efficient string similarity search and join
    Cui, Jia
    Meng, Dan
    Chen, Zhong-Tao
    [J]. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2014, 8870
  • [8] Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join
    Lu, Jiaheng
    Lin, Chunbin
    Wang, Jin
    Li, Chen
    [J]. PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM '19), 2019, : 2975 - 2976
  • [9] Incremental processing for string similarity join
    Yan, Cairong
    Zhu, Bin
    Gan, Yanglan
    Xu, Guangwei
    [J]. INTERNATIONAL JOURNAL OF COMPUTATIONAL SCIENCE AND ENGINEERING, 2019, 20 (02) : 255 - 268
  • [10] Parallelizing String Similarity Join Algorithms
    Yao, Ling-Chih
    Lim, Lipyeow
    [J]. DATABASES THEORY AND APPLICATIONS, ADC 2018, 2018, 10837 : 322 - 327