String similarity join with different similarity thresholds based on novel indexing techniques

被引：3

作者：

Rong, Chuitian ^{[1
]}

Silva, Yasin N. ^{[2
]}

Li, Chunqing ^{[1
]}

机构：

[1] Tianjin Polytech Univ, Sch Comp Sci & Software Engn, Tianjin 300387, Peoples R China

[2] Arizona State Univ, Sch Math & Nat Sci, Tempe, AZ 85281 USA

来源：

FRONTIERS OF COMPUTER SCIENCE | 2017年 / 11卷 / 02期

基金：

中国国家自然科学基金;

关键词：

similarity join; similarity aware index; similarity thresholds; EFFICIENT;

D O I：

10.1007/s11704-016-5231-1

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

String similarity join is an essential operation of many applications that need to find all similar string pairs from two given collections. A quantitative way to determine whether two strings are similar is to compute their similarity based on a certain similarity function. The string pairs with similarity above a certain threshold are regarded as results. The current approach to solving the similarity join problem is to use a unique threshold value. There are, however, several scenarios that require the support of multiple thresholds, for instance, when the dataset includes strings of various lengths. In this scenario, longer string pairs typically tolerate much more typos than shorter ones. Therefore, we proposed a solution for string similarity joins that supports different similarity thresholds in a single operator. In order to support different thresholds, we devised two novel indexing techniques: partition based indexing and similarity aware indexing. To utilize the new indices and improve the join performance, we proposed new filtering methods and index probing techniques. To the best of our knowledge, this is the first work that addresses this problem. Experimental results on real-world datasets show that our solution performs efficiently while providing a more flexible threshold specification.

引用

页码：307 / 319

页数：13

共 50 条

[1] String similarity join with different similarity thresholds based on novel indexing techniques
Chuitian Rong
Yasin N. Silva
Chunqing Li
[J]. Frontiers of Computer Science, 2017, 11 : 307 - 319
[2] String Similarity Join with Different Thresholds
Rong, Chuitian
Zhang, Xiangling
[J]. KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, KSEM 2015, 2015, 9403 : 260 - 271
[3] Incremental processing for string similarity join
Yan, Cairong
Zhu, Bin
Gan, Yanglan
Xu, Guangwei
[J]. INTERNATIONAL JOURNAL OF COMPUTATIONAL SCIENCE AND ENGINEERING, 2019, 20 (02) : 255 - 268
[4] String similarity search and join: a survey
Minghe Yu
Guoliang Li
Dong Deng
Jianhua Feng
[J]. Frontiers of Computer Science, 2016, 10 : 399 - 417
[5] String similarity search and join: a survey
Yu, Minghe
Li, Guoliang
Deng, Dong
Feng, Jianhua
[J]. FRONTIERS OF COMPUTER SCIENCE, 2016, 10 (03) : 399 - 417
[6] Parallelizing String Similarity Join Algorithms
Yao, Ling-Chih
Lim, Lipyeow
[J]. DATABASES THEORY AND APPLICATIONS, ADC 2018, 2018, 10837 : 322 - 327
[7] String similarity search and join:a survey
Minghe YU
Guoliang LI
Dong DENG
Jianhua FENG
[J]. Frontiers of Computer Science, 2016, 10 (03) : 399 - 417
[8] Hashed-Join: Approximate String Similarity Join with Hashing
Yuan, Peisen
Sha, Chaofeng
Sun, Yi
[J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2014, 2014, 8505 : 217 - 229
[9] EFFICIENT STRING EDIT SIMILARITY JOIN ALGORITHM
Gouda, Karam
Rashad, Metwally
[J]. COMPUTING AND INFORMATICS, 2017, 36 (03) : 683 - 704
[10] LS-Join: Local Similarity Join on String Collections
Wang, Jiaying
Yang, Xiaochun
Wang, Bin
Liu, Chengfei
[J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2017, 29 (09) : 1928 - 1942

← 1 2 3 4 5 →