A Scalable Similarity Join Algorithm Based on MapReduce and LSH

被引：3

作者：

Rivault, Sebastien ^{[1
]}

Bamha, Mostafa ^{[1
]}

Limet, Sebastien ^{[1
]}

Robert, Sophie ^{[1
]}

机构：

[1] Univ Orleans, INSA Ctr Val Loire, EA, LIFO, F-4022 Orleans, France

来源：

INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING | 2022年 / 50卷 / 3-4期

关键词：

Similarity join operations; Local sensitive hashing (LSH); MapReduce model; Data skew; Hadoop framework;

D O I：

10.1007/s10766-022-00733-6

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Similarity joins are recognized to be among the most useful data processing and analysis operations. A similarity join is used to retrieve all data pairs whose distances are smaller than a predefined threshold 2. In this paper, we introduce the MRS-join algorithm to perform similarity joins on large trajectory datasets. The MapReduce model and a randomized local sensitive hashing keys redistribution approach are used to balance load among processing nodes while reducing communications and computations to almost all relevant data by using distributed histograms. A cost analysis of the MRS-join algorithm shows that our approach is insensitive to data skew and guarantees perfect balancing properties, in large scale systems, during all stages of similarity join computations. These performances have been confirmed by a series of experiments using the Frechet distance on large datasets of trajectories from real world and synthetic data benchmarks.

引用

页码：360 / 380

页数：21

共 50 条

[21] Fast and scalable vector similarity joins with MapReduce
Byoungju Yang
Hyun Joon Kim
Junho Shim
Dongjoo Lee
Sang-goo Lee
Journal of Intelligent Information Systems, 2016, 46 : 473 - 497
[22] Scalable SimRank Join Algorithm
Maehara, Takanori
Kusumoto, Mitsuru
Kawarabayashi, Ken-ichi
2015 IEEE 31ST INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2015, : 603 - 614
[23] Efficient and Scalable Processing of String Similarity Join
Rong, Chuitian
Lu, Wei
Wang, Xiaoli
Du, Xiaoyong
Chen, Yueguo
Tung, Anthony K. H.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2013, 25 (10) : 2217 - 2230
[24] Set similarity join on massive probabilistic data using MapReduce
Youzhong Ma
Xiaofeng Meng
Distributed and Parallel Databases, 2014, 32 : 447 - 464
[25] Efficient Spatio-textual Similarity Join Using MapReduce
Zhang, Yu
Ma, Youzhong
Meng, Xiaofeng
2014 IEEE/WIC/ACM INTERNATIONAL JOINT CONFERENCES ON WEB INTELLIGENCE (WI) AND INTELLIGENT AGENT TECHNOLOGIES (IAT), VOL 1, 2014, : 52 - 59
[26] Set similarity join on massive probabilistic data using MapReduce
Ma, Youzhong
Meng, Xiaofeng
DISTRIBUTED AND PARALLEL DATABASES, 2014, 32 (03) : 447 - 464
[27] DIGDUG: Scalable Separable Dense Graph Pruning and Join Operations in MapReduce
Shukla, Manu
Dharme, Dinesh
Ramnarain, Pallavi
Santos, Ray Dos
Lu, Chang-Tien
IEEE TRANSACTIONS ON BIG DATA, 2021, 7 (06) : 930 - 951
[28] Fuzzy Similarity Join Algorithm Based on Dynamic Double Prefixes
Yu C.-Y.
Wang W.-H.
Wen X.-J.
Zhao Y.-H.
Dongbei Daxue Xuebao/Journal of Northeastern University, 2022, 43 (03): : 321 - 327
[29] A novel KNN join algorithm with Hilbert curve in MapReduce
Du, Qinsheng
Li, Xiongfei
Te, Regen
ICIC Express Letters, 2014, 8 (09): : 2537 - 2544
[30] An Efficient Similarity Join Algorithm with Cosine Similarity Predicate
Lee, Dongjoo
Park, Jaehui
Shim, Junho
Lee, Sang-goo
DATABASE AND EXPERT SYSTEMS APPLICATIONS, PT 2, 2010, 6262 : 422 - +

← 1 2 3 4 5 →