A Scalable Similarity Join Algorithm Based on MapReduce and LSH

被引:3
|
作者
Rivault, Sebastien [1 ]
Bamha, Mostafa [1 ]
Limet, Sebastien [1 ]
Robert, Sophie [1 ]
机构
[1] Univ Orleans, INSA Ctr Val Loire, EA, LIFO, F-4022 Orleans, France
关键词
Similarity join operations; Local sensitive hashing (LSH); MapReduce model; Data skew; Hadoop framework;
D O I
10.1007/s10766-022-00733-6
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Similarity joins are recognized to be among the most useful data processing and analysis operations. A similarity join is used to retrieve all data pairs whose distances are smaller than a predefined threshold 2. In this paper, we introduce the MRS-join algorithm to perform similarity joins on large trajectory datasets. The MapReduce model and a randomized local sensitive hashing keys redistribution approach are used to balance load among processing nodes while reducing communications and computations to almost all relevant data by using distributed histograms. A cost analysis of the MRS-join algorithm shows that our approach is insensitive to data skew and guarantees perfect balancing properties, in large scale systems, during all stages of similarity join computations. These performances have been confirmed by a series of experiments using the Frechet distance on large datasets of trajectories from real world and synthetic data benchmarks.
引用
收藏
页码:360 / 380
页数:21
相关论文
共 50 条
  • [41] Scalable Hybrid Similarity Join over Geolocated Time Series
    Chatzigeorgakidis, Georgios
    Patroumpas, Kostas
    Skoutas, Dimitrios
    Athanasiou, Spiros
    Skiadopoulos, Spiros
    26TH ACM SIGSPATIAL INTERNATIONAL CONFERENCE ON ADVANCES IN GEOGRAPHIC INFORMATION SYSTEMS (ACM SIGSPATIAL GIS 2018), 2018, : 119 - 128
  • [42] MR-BIRCH: A scalable MapReduce-based BIRCH clustering algorithm
    Li, Yufeng
    Jiang, HaiTian
    Lu, Jiyong
    Li, Xiaozhong
    Sun, Zhiwei
    Li, Min
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2021, 40 (03) : 5295 - 5305
  • [43] EFFICIENT STRING EDIT SIMILARITY JOIN ALGORITHM
    Gouda, Karam
    Rashad, Metwally
    COMPUTING AND INFORMATICS, 2017, 36 (03) : 683 - 704
  • [44] A linear DBSCAN algorithm based on LSH
    Wu, Yi-Pu
    Guo, Jin-Jiang
    Zhang, Xue-Jie
    PROCEEDINGS OF 2007 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7, 2007, : 2608 - 2614
  • [45] Scalable Quick Reduct Algorithm - Iterative MapReduce Approach
    Singh, Praveen Kumar
    Prasad, P. S. V. S. Sai
    PROCEEDINGS OF THE THIRD ACM IKDD CONFERENCE ON DATA SCIENCES (CODS), 2016,
  • [46] SALA: A Skew-Avoiding and Locality-Aware Algorithm for MapReduce-Based Join
    Lin, Ziyu
    Cai, Minxing
    Huang, Ziming
    Lai, Yongxuan
    WEB-AGE INFORMATION MANAGEMENT (WAIM 2015), 2015, 9098 : 311 - 323
  • [47] Research on Load Balancing MapReduce Equivalent Join Based on Intelligent Sampling and Multi Knapsack Algorithm
    Yang, Cai
    Yang, Jizheng
    Jia, Songhao
    Chen, Xing
    Liu, Yan
    RECENT ADVANCES IN ELECTRICAL & ELECTRONIC ENGINEERING, 2022, 15 (04) : 335 - 346
  • [48] C2Net: A Network-Efficient Approach to Collision Counting LSH Similarity Join
    Li, Hangyu
    Nutanong, Sarana
    Xu, Hong
    Yu, Chenyun
    Ha, Foryu
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2019, 31 (03) : 423 - 436
  • [49] C2Net: A Network-Efficient Approach to Collision Counting LSH Similarity Join
    Li, Hangyu
    Nutanong, Sarana
    Xu, Hong
    Yu, Chenyun
    Ha, Foryu
    2019 IEEE 35TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2019), 2019, : 2121 - 2122
  • [50] Join processing with threshold-based filtering in MapReduce
    Lee, Taewhi
    Bae, Hye-Chan
    Kim, Hyoung-Joo
    JOURNAL OF SUPERCOMPUTING, 2014, 69 (02): : 793 - 813