How improve Set Similarity Join based on prefix approach in distributed environment

被引:0
|
作者
Zhu, Song [1 ]
Gagliardelli, Luca [1 ]
Simonini, Giovanni [1 ]
Beneventano, Domenico [1 ]
机构
[1] Univ Modena & Reggio Emilia, Dept Engn Enzo Ferrari, Modena, Italy
关键词
Similarity Join; Big Data; Record Linkage; MAPREDUCE;
D O I
10.1109/HPCS.2018.00136
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Set similarity join is an essential operation to find similar pairs of records in data integration and data analytics applications. To cope with the increasing scale of the data, several techniques have been proposed to perform set similarity join using distributed frameworks (e.g. MapReduce). In particular, it is publicly available a MapReduce implementation of the PPJoin, that was experimentally demonstrated as one of the best set similarity join algorithm. However, these techniques produce huge amounts of duplicates in order to perform a successful parallel processing. Moreover, these approaches do not guarantee the load balancing, which generates skewness problem and less scalability of these techniques. To address these problems, we propose a duplicate-free technique called TTJoin, that performs set similarity join efficiently by utilizing an innovative filter derived from the prefix filter. Moreover, we implemented TTJoin on Apache Spark, that is one of the most innovative distributed framework. Several experiments on real-world datasets demonstrate the effectiveness of proposed solution with respect to either traditional TTJoin MapReduce implementation.
引用
收藏
页码:844 / 851
页数:8
相关论文
共 50 条
  • [1] Distributed Streaming Set Similarity Join
    Yang, Jianye
    Zhang, Wenjie
    Wang, Xiang
    Zhang, Ying
    Lin, Xuemin
    2020 IEEE 36TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2020), 2020, : 565 - 576
  • [2] Generalizing prefix filtering to improve set similarity joins
    Ribeiro, Leonardo Andrade
    Haerder, Theo
    INFORMATION SYSTEMS, 2011, 36 (01) : 62 - 78
  • [3] Dynamic Set Similarity Join: An Update Log Based Approach
    Yang, Chengcheng
    Chen, Lisi
    Wang, Hao
    Shang, Shuo
    Mao, Rui
    Zhang, Xiangliang
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (04) : 3727 - 3741
  • [4] Semi-Stream Similarity Join Processing in a Distributed Environment
    Kim, Hong-Ji
    Lee, Ki-Hoon
    IEEE ACCESS, 2020, 8 : 130194 - 130204
  • [5] A Prefix-Filter based Method for Spatio-Textual Similarity Join
    Liu, Sitong
    Li, Guoliang
    Feng, Jianhua
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2014, 26 (10) : 2354 - 2367
  • [6] Power-Law Based Estimation of Set Similarity Join Size
    Lee, Hongrae
    Ng, Raymond T.
    Shim, Kyuseok
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2009, 2 (01): : 658 - 669
  • [7] Subseries Join: A Similarity-Based Time Series Match Approach
    Lin, Yi
    McCool, Michael D.
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PT I, PROCEEDINGS, 2010, 6118 : 238 - +
  • [8] Parallel set similarity join on big data based on Locality-Sensitive Hashing
    Sohrabi, Mohammad Karim
    Azgomi, Hosseion
    SCIENCE OF COMPUTER PROGRAMMING, 2017, 145 : 1 - 12
  • [9] LSF-Join: Locality Sensitive Filtering for Distributed All-Pairs Set Similarity Under Skew
    Rashtchian, Cyrus
    Sharma, Aneesh
    Woodruff, David P.
    WEB CONFERENCE 2020: PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE (WWW 2020), 2020, : 2998 - 3004
  • [10] Distributed Entity Resolution Based on Similarity Join for Large-Scale Data Clustering
    Nie, Tiezheng
    Lee, Wang-chien
    Shen, Derong
    Yu, Ge
    Kou, Yue
    WEB-AGE INFORMATION MANAGEMENT, WAIM 2014, 2014, 8485 : 138 - 149