A Scalable Similarity Join Algorithm Based on MapReduce and LSH

被引:3
|
作者
Rivault, Sebastien [1 ]
Bamha, Mostafa [1 ]
Limet, Sebastien [1 ]
Robert, Sophie [1 ]
机构
[1] Univ Orleans, INSA Ctr Val Loire, EA, LIFO, F-4022 Orleans, France
关键词
Similarity join operations; Local sensitive hashing (LSH); MapReduce model; Data skew; Hadoop framework;
D O I
10.1007/s10766-022-00733-6
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Similarity joins are recognized to be among the most useful data processing and analysis operations. A similarity join is used to retrieve all data pairs whose distances are smaller than a predefined threshold 2. In this paper, we introduce the MRS-join algorithm to perform similarity joins on large trajectory datasets. The MapReduce model and a randomized local sensitive hashing keys redistribution approach are used to balance load among processing nodes while reducing communications and computations to almost all relevant data by using distributed histograms. A cost analysis of the MRS-join algorithm shows that our approach is insensitive to data skew and guarantees perfect balancing properties, in large scale systems, during all stages of similarity join computations. These performances have been confirmed by a series of experiments using the Frechet distance on large datasets of trajectories from real world and synthetic data benchmarks.
引用
收藏
页码:360 / 380
页数:21
相关论文
共 50 条
  • [21] Fast and scalable vector similarity joins with MapReduce
    Byoungju Yang
    Hyun Joon Kim
    Junho Shim
    Dongjoo Lee
    Sang-goo Lee
    Journal of Intelligent Information Systems, 2016, 46 : 473 - 497
  • [22] Scalable SimRank Join Algorithm
    Maehara, Takanori
    Kusumoto, Mitsuru
    Kawarabayashi, Ken-ichi
    2015 IEEE 31ST INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2015, : 603 - 614
  • [23] Efficient and Scalable Processing of String Similarity Join
    Rong, Chuitian
    Lu, Wei
    Wang, Xiaoli
    Du, Xiaoyong
    Chen, Yueguo
    Tung, Anthony K. H.
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2013, 25 (10) : 2217 - 2230
  • [24] Set similarity join on massive probabilistic data using MapReduce
    Youzhong Ma
    Xiaofeng Meng
    Distributed and Parallel Databases, 2014, 32 : 447 - 464
  • [25] Efficient Spatio-textual Similarity Join Using MapReduce
    Zhang, Yu
    Ma, Youzhong
    Meng, Xiaofeng
    2014 IEEE/WIC/ACM INTERNATIONAL JOINT CONFERENCES ON WEB INTELLIGENCE (WI) AND INTELLIGENT AGENT TECHNOLOGIES (IAT), VOL 1, 2014, : 52 - 59
  • [26] Set similarity join on massive probabilistic data using MapReduce
    Ma, Youzhong
    Meng, Xiaofeng
    DISTRIBUTED AND PARALLEL DATABASES, 2014, 32 (03) : 447 - 464
  • [27] DIGDUG: Scalable Separable Dense Graph Pruning and Join Operations in MapReduce
    Shukla, Manu
    Dharme, Dinesh
    Ramnarain, Pallavi
    Santos, Ray Dos
    Lu, Chang-Tien
    IEEE TRANSACTIONS ON BIG DATA, 2021, 7 (06) : 930 - 951
  • [28] Fuzzy Similarity Join Algorithm Based on Dynamic Double Prefixes
    Yu C.-Y.
    Wang W.-H.
    Wen X.-J.
    Zhao Y.-H.
    Dongbei Daxue Xuebao/Journal of Northeastern University, 2022, 43 (03): : 321 - 327
  • [29] A novel KNN join algorithm with Hilbert curve in MapReduce
    Du, Qinsheng
    Li, Xiongfei
    Te, Regen
    ICIC Express Letters, 2014, 8 (09): : 2537 - 2544
  • [30] An Efficient Similarity Join Algorithm with Cosine Similarity Predicate
    Lee, Dongjoo
    Park, Jaehui
    Shim, Junho
    Lee, Sang-goo
    DATABASE AND EXPERT SYSTEMS APPLICATIONS, PT 2, 2010, 6262 : 422 - +