Metric Similarity Joins Using MapReduce

被引:22
|
作者
Chen, Gang [1 ,2 ]
Yang, Keyu [1 ]
Chen, Lu [1 ]
Gao, Yunjun [1 ,2 ]
Zheng, Baihua [3 ]
Chen, Chun [1 ]
机构
[1] Zhejiang Univ, Coll Comp Sci, 38 Zheda Rd, Hangzhou 310027, Peoples R China
[2] Zhejiang Univ, Key Lab Big Data Intelligent Comp Zhejiang Prov, 38 Zheda Rd, Hangzhou 310027, Peoples R China
[3] Singapore Management Univ, Sch Informat Syst, Singapore 178902, Singapore
关键词
Similarity joins; metric space; MapReduce; algorithm; QUERIES;
D O I
10.1109/TKDE.2016.2631599
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Given two object sets Q and O, a metric similarity join finds similar object pairs according to a certain criterion. This operation has a wide variety of applications in data cleaning and data mining, to name but a few. However, the rapidly growing volume of data nowadays challenges traditional metric similarity join methods, and thus, a distributed method is required. In this paper, we adopt a popular distributed framework, namely, MapReduce, to support scalable metric similarity joins. To ensure the load balancing, we present two sampling based partition methods. One utilizes the pivot and the space-filling curve mappings to cluster the data into one-dimensional space, and then selects high quality centroids to enable equal-sized partitions. The other uses the KD-tree partitioning technique to equally divide the data after the pivot mapping. To avoid unnecessary object pair evaluation, we propose a framework that maps the two involved object sets in order, where the range-object filtering, the double-pivot filtering, the pivot filtering, and the plane sweeping techniques are utilized for pruning. Extensive experiments with both real and synthetic data sets demonstrate that our solutions outperform significantly existing state-of-the-art competitors.
引用
收藏
页码:656 / 669
页数:14
相关论文
共 50 条
  • [31] Multidimensional Similarity Join Using MapReduce
    Li, Ye
    Wang, Jian
    Hou, Leong U.
    WEB-AGE INFORMATION MANAGEMENT, PT II, 2016, 9659 : 457 - 468
  • [32] Comparing MapReduce-Basedk-NN Similarity Joins on Hadoop for High-Dimensional Data
    Cech, Premysl
    Marousek, Jakub
    Lokoc, Jakub
    Silva, Yasin N.
    Starks, Jeremy
    ADVANCED DATA MINING AND APPLICATIONS, ADMA 2017, 2017, 10604 : 63 - 75
  • [33] Efficient Processing of k Nearest Neighbor Joins using MapReduce
    Lu, Wei
    Shen, Yanyan
    Chen, Su
    Ooi, Beng Chin
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2012, 5 (10): : 1016 - 1027
  • [34] Solving similarity joins and range queries in metric spaces with the list of twin clusters
    Paredes, Rodrigo
    Reyes, Nora
    JOURNAL OF DISCRETE ALGORITHMS, 2009, 7 (01) : 18 - 35
  • [35] Accelerating Set Similarity Joins Using GPUs
    Cruz, Mateus S. H.
    Kozawa, Yusuke
    Amagasa, Toshiyuki
    Kitagawa, Hiroyuki
    TRANSACTIONS ON LARGE-SCALE DATA- AND KNOWLEDGE-CENTERED SYSTEMS XXVIII: SPECIAL ISSUE ON DATABASE- AND EXPERT-SYSTEMS APPLICATIONS, 2016, 9940 : 1 - 22
  • [36] XML Structural Similarity Search Using MapReduce
    Yuan, Peisen
    Sha, Chaofeng
    Wang, Xiaoling
    Yang, Bin
    Zhou, Aoying
    Yang, Su
    WEB-AGE INFORMATION MANAGEMENT, PROCEEDINGS, 2010, 6184 : 169 - +
  • [37] Detecting Text Similarity Using MapReduce Framework
    Birjali, Marouane
    Beni-Hssane, Abderrahim
    Erritali, Mohammed
    Madani, Youness
    EUROPE AND MENA COOPERATION ADVANCES IN INFORMATION AND COMMUNICATION TECHNOLOGIES, 2017, 520 : 383 - 389
  • [38] Fuzzy Joins in MapReduce: An Experimental Study
    Kimmett, Ben
    Srinivasan, Venkatesh
    Thomo, Alex
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2015, 8 (12): : 1514 - 1517
  • [39] Application of Filters to Multiway Joins in MapReduce
    Lee, Taewhi
    Im, Dong-Hyuk
    Kim, Hangkyu
    Kim, Hyoung-Joo
    MATHEMATICAL PROBLEMS IN ENGINEERING, 2014, 2014
  • [40] Parallel Computation of k-Nearest Neighbor Joins Using MapReduce
    Kim, Wooyeol
    Kim, Younghoon
    Shim, Kyuseok
    2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 696 - 705