A Scalable Similarity Join Algorithm Based on MapReduce and LSH

被引:3
|
作者
Rivault, Sebastien [1 ]
Bamha, Mostafa [1 ]
Limet, Sebastien [1 ]
Robert, Sophie [1 ]
机构
[1] Univ Orleans, INSA Ctr Val Loire, EA, LIFO, F-4022 Orleans, France
关键词
Similarity join operations; Local sensitive hashing (LSH); MapReduce model; Data skew; Hadoop framework;
D O I
10.1007/s10766-022-00733-6
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Similarity joins are recognized to be among the most useful data processing and analysis operations. A similarity join is used to retrieve all data pairs whose distances are smaller than a predefined threshold 2. In this paper, we introduce the MRS-join algorithm to perform similarity joins on large trajectory datasets. The MapReduce model and a randomized local sensitive hashing keys redistribution approach are used to balance load among processing nodes while reducing communications and computations to almost all relevant data by using distributed histograms. A cost analysis of the MRS-join algorithm shows that our approach is insensitive to data skew and guarantees perfect balancing properties, in large scale systems, during all stages of similarity join computations. These performances have been confirmed by a series of experiments using the Frechet distance on large datasets of trajectories from real world and synthetic data benchmarks.
引用
收藏
页码:360 / 380
页数:21
相关论文
共 50 条
  • [31] A Study on Subsequence Similarity Join in Time Series Data Using MapReduce
    Park, Kyounghyun
    Won, Hee Sun
    Ryu, Keun Ho
    ADVANCES IN COMPUTER SCIENCE AND UBIQUITOUS COMPUTING, 2018, 474 : 851 - 859
  • [32] Parallel Top-K Similarity Join Algorithms Using MapReduce
    Kim, Younghoon
    Shim, Kyuseok
    2012 IEEE 28TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2012, : 510 - 521
  • [33] Scalable Collaborative Filtering Recommendation Algorithm with MapReduce
    Shang, Yang
    Li, Zhiyang
    Qu, Wenyu
    Xu, Yujie
    Song, Zining
    Zhou, Xuefei
    2014 IEEE 12TH INTERNATIONAL CONFERENCE ON DEPENDABLE, AUTONOMIC AND SECURE COMPUTING (DASC)/2014 IEEE 12TH INTERNATIONAL CONFERENCE ON EMBEDDED COMPUTING (EMBEDDEDCOM)/2014 IEEE 12TH INTERNATIONAL CONF ON PERVASIVE INTELLIGENCE AND COMPUTING (PICOM), 2014, : 103 - 108
  • [34] RPK-table based efficient algorithm for join-aggregate query on MapReduce
    Li, Zhan
    Feng, Qi
    Chen, Wei
    Wang, Tengjiao
    CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY, 2016, 1 (01) : 79 - 89
  • [35] Algorithm for processing k-nearest join based on R-tree in MapReduce
    Liu, Yi
    Jing, Ning
    Chen, Luo
    Xiong, Wei
    Ruan Jian Xue Bao/Journal of Software, 2013, 24 (08): : 1836 - 1851
  • [36] Projection Based Large Scale High-Dimensional Data Similarity Join Using MapReduce Framework
    Ma, Youzhong
    Zhang, Ruiling
    Cui, Zhanyou
    Lin, Chunjie
    IEEE ACCESS, 2020, 8 : 121665 - 121677
  • [37] Performance Evaluation for Distributed Join Based on MapReduce
    Zhang, Jingwei
    Yang, Qing
    Shang, Hongjia
    Zhang, Huibing
    Lin, Yuming
    Zhou, Rui
    2016 7TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND BIG DATA (CCBD), 2016, : 295 - 301
  • [38] FrepJoin: an efficient partition-based algorithm for edit similarity join
    Ji-zhou Luo
    Sheng-fei Shi
    Hong-zhi Wang
    Jian-zhong Li
    Frontiers of Information Technology & Electronic Engineering, 2017, 18 : 1499 - 1510
  • [39] FrepJoin:an efficient partition-based algorithm for edit similarity join
    Ji-zhou LUO
    Sheng-fei SHI
    Hong-zhi WANG
    Jian-zhong LI
    Frontiers of Information Technology & Electronic Engineering, 2017, 18 (10) : 1499 - 1510
  • [40] FrepJoin: an efficient partition-based algorithm for edit similarity join
    Luo, Ji-zhou
    Shi, Sheng-fei
    Wang, Hong-zhi
    Li, Jian-zhong
    FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2017, 18 (10) : 1499 - 1510