How improve Set Similarity Join based on prefix approach in distributed environment

被引：0

作者：

Zhu, Song ^{[1
]}

Gagliardelli, Luca ^{[1
]}

Simonini, Giovanni ^{[1
]}

Beneventano, Domenico ^{[1
]}

机构：

[1] Univ Modena & Reggio Emilia, Dept Engn Enzo Ferrari, Modena, Italy

来源：

PROCEEDINGS 2018 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING & SIMULATION (HPCS) | 2018年

关键词：

Similarity Join; Big Data; Record Linkage; MAPREDUCE;

D O I：

10.1109/HPCS.2018.00136

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Set similarity join is an essential operation to find similar pairs of records in data integration and data analytics applications. To cope with the increasing scale of the data, several techniques have been proposed to perform set similarity join using distributed frameworks (e.g. MapReduce). In particular, it is publicly available a MapReduce implementation of the PPJoin, that was experimentally demonstrated as one of the best set similarity join algorithm. However, these techniques produce huge amounts of duplicates in order to perform a successful parallel processing. Moreover, these approaches do not guarantee the load balancing, which generates skewness problem and less scalability of these techniques. To address these problems, we propose a duplicate-free technique called TTJoin, that performs set similarity join efficiently by utilizing an innovative filter derived from the prefix filter. Moreover, we implemented TTJoin on Apache Spark, that is one of the most innovative distributed framework. Several experiments on real-world datasets demonstrate the effectiveness of proposed solution with respect to either traditional TTJoin MapReduce implementation.

引用

页码：844 / 851

页数：8

共 50 条

[1] Distributed Streaming Set Similarity Join
Yang, Jianye
Zhang, Wenjie
Wang, Xiang
Zhang, Ying
Lin, Xuemin
2020 IEEE 36TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2020), 2020, : 565 - 576
[2] Generalizing prefix filtering to improve set similarity joins
Ribeiro, Leonardo Andrade
Haerder, Theo
INFORMATION SYSTEMS, 2011, 36 (01) : 62 - 78
[3] Dynamic Set Similarity Join: An Update Log Based Approach
Yang, Chengcheng
Chen, Lisi
Wang, Hao
Shang, Shuo
Mao, Rui
Zhang, Xiangliang
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (04) : 3727 - 3741
[4] Semi-Stream Similarity Join Processing in a Distributed Environment
Kim, Hong-Ji
Lee, Ki-Hoon
IEEE ACCESS, 2020, 8 : 130194 - 130204
[5] A Prefix-Filter based Method for Spatio-Textual Similarity Join
Liu, Sitong
Li, Guoliang
Feng, Jianhua
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2014, 26 (10) : 2354 - 2367
[6] Power-Law Based Estimation of Set Similarity Join Size
Lee, Hongrae
Ng, Raymond T.
Shim, Kyuseok
PROCEEDINGS OF THE VLDB ENDOWMENT, 2009, 2 (01): : 658 - 669
[7] Subseries Join: A Similarity-Based Time Series Match Approach
Lin, Yi
McCool, Michael D.
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PT I, PROCEEDINGS, 2010, 6118 : 238 - +
[8] Parallel set similarity join on big data based on Locality-Sensitive Hashing
Sohrabi, Mohammad Karim
Azgomi, Hosseion
SCIENCE OF COMPUTER PROGRAMMING, 2017, 145 : 1 - 12
[9] LSF-Join: Locality Sensitive Filtering for Distributed All-Pairs Set Similarity Under Skew
Rashtchian, Cyrus
Sharma, Aneesh
Woodruff, David P.
WEB CONFERENCE 2020: PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE (WWW 2020), 2020, : 2998 - 3004
[10] Distributed Entity Resolution Based on Similarity Join for Large-Scale Data Clustering
Nie, Tiezheng
Lee, Wang-chien
Shen, Derong
Yu, Ge
Kou, Yue
WEB-AGE INFORMATION MANAGEMENT, WAIM 2014, 2014, 8485 : 138 - 149

← 1 2 3 4 5 →