Metric Similarity Joins Using MapReduce

被引:22
|
作者
Chen, Gang [1 ,2 ]
Yang, Keyu [1 ]
Chen, Lu [1 ]
Gao, Yunjun [1 ,2 ]
Zheng, Baihua [3 ]
Chen, Chun [1 ]
机构
[1] Zhejiang Univ, Coll Comp Sci, 38 Zheda Rd, Hangzhou 310027, Peoples R China
[2] Zhejiang Univ, Key Lab Big Data Intelligent Comp Zhejiang Prov, 38 Zheda Rd, Hangzhou 310027, Peoples R China
[3] Singapore Management Univ, Sch Informat Syst, Singapore 178902, Singapore
关键词
Similarity joins; metric space; MapReduce; algorithm; QUERIES;
D O I
10.1109/TKDE.2016.2631599
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Given two object sets Q and O, a metric similarity join finds similar object pairs according to a certain criterion. This operation has a wide variety of applications in data cleaning and data mining, to name but a few. However, the rapidly growing volume of data nowadays challenges traditional metric similarity join methods, and thus, a distributed method is required. In this paper, we adopt a popular distributed framework, namely, MapReduce, to support scalable metric similarity joins. To ensure the load balancing, we present two sampling based partition methods. One utilizes the pivot and the space-filling curve mappings to cluster the data into one-dimensional space, and then selects high quality centroids to enable equal-sized partitions. The other uses the KD-tree partitioning technique to equally divide the data after the pivot mapping. To avoid unnecessary object pair evaluation, we propose a framework that maps the two involved object sets in order, where the range-object filtering, the double-pivot filtering, the pivot filtering, and the plane sweeping techniques are utilized for pruning. Extensive experiments with both real and synthetic data sets demonstrate that our solutions outperform significantly existing state-of-the-art competitors.
引用
收藏
页码:656 / 669
页数:14
相关论文
共 50 条
  • [21] An efficient MapReduce algorithm for similarity join in metric spaces
    Liu, Wen
    Shen, Yanming
    Wang, Peng
    JOURNAL OF SUPERCOMPUTING, 2016, 72 (03): : 1179 - 1200
  • [22] Secure Joins with MapReduce
    Bultel, Xavier
    Ciucanu, Radu
    Giraud, Matthieu
    Lafourcade, Pascal
    Ye, Lihua
    FOUNDATIONS AND PRACTICE OF SECURITY, FPS 2018, 2019, 11358 : 78 - 94
  • [23] On Spatial Joins in MapReduce
    Sabek, Ibrahim
    Mokbel, Mohamed F.
    25TH ACM SIGSPATIAL INTERNATIONAL CONFERENCE ON ADVANCES IN GEOGRAPHIC INFORMATION SYSTEMS (ACM SIGSPATIAL GIS 2017), 2017,
  • [24] Efficient processing distributed joins with bloomfilter using MapReduce
    School of Computer Science and Technology, University of Science and Technology of China, Hefei, 230026, China
    Li, J. (lj@ustc.edu.cn), 1600, Science and Engineering Research Support Society, 20 Virginia Court, Sandy Bay, Tasmania, Australia (06):
  • [25] Efficient Processing Distributed Joins with Bloomfilter using MapReduce
    Zhang, Changchun
    Wu, Lei
    Li, Jing
    INTERNATIONAL JOURNAL OF GRID AND DISTRIBUTED COMPUTING, 2013, 6 (03): : 43 - 57
  • [26] Optimizing Distributed Joins with Bloom Filters Using MapReduce
    Zhang, Changchun
    Wu, Lei
    Li, Jing
    COMPUTER APPLICATIONS FOR GRAPHICS, GRID COMPUTING, AND INDUSTRIAL ENVIRONMENT, 2012, 351 : 88 - 95
  • [27] An efficient algorithm for approximated self-similarity joins in metric spaces
    Ferrada, Sebastian
    Bustos, Benjamin
    Reyes, Nora
    INFORMATION SYSTEMS, 2020, 91
  • [28] MapReduce Based Personalized Locality Sensitive Hashing for Similarity Joins on Large Scale Data
    Wang, Jingjing
    Lin, Chen
    COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2015, 2015
  • [29] List of twin clusters: a data structure for similarity joins in metric spaces
    Paredes, Rodrigo
    Reyes, Nora
    2008 IEEE 24TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING WORKSHOP, VOLS 1 AND 2, 2008, : 578 - +
  • [30] List of twin clusters: a data structure for similarity joins in metric spaces
    Paredes, Rodrigo
    Reyes, Nora
    SISAP 2008: FIRST INTERNATIONAL WORKSHOP ON SIMILARITY SEARCH AND APPLICATIONS, PROCEEDINGS, 2008, : 131 - 138