Metric Similarity Joins Using MapReduce

被引:22
|
作者
Chen, Gang [1 ,2 ]
Yang, Keyu [1 ]
Chen, Lu [1 ]
Gao, Yunjun [1 ,2 ]
Zheng, Baihua [3 ]
Chen, Chun [1 ]
机构
[1] Zhejiang Univ, Coll Comp Sci, 38 Zheda Rd, Hangzhou 310027, Peoples R China
[2] Zhejiang Univ, Key Lab Big Data Intelligent Comp Zhejiang Prov, 38 Zheda Rd, Hangzhou 310027, Peoples R China
[3] Singapore Management Univ, Sch Informat Syst, Singapore 178902, Singapore
关键词
Similarity joins; metric space; MapReduce; algorithm; QUERIES;
D O I
10.1109/TKDE.2016.2631599
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Given two object sets Q and O, a metric similarity join finds similar object pairs according to a certain criterion. This operation has a wide variety of applications in data cleaning and data mining, to name but a few. However, the rapidly growing volume of data nowadays challenges traditional metric similarity join methods, and thus, a distributed method is required. In this paper, we adopt a popular distributed framework, namely, MapReduce, to support scalable metric similarity joins. To ensure the load balancing, we present two sampling based partition methods. One utilizes the pivot and the space-filling curve mappings to cluster the data into one-dimensional space, and then selects high quality centroids to enable equal-sized partitions. The other uses the KD-tree partitioning technique to equally divide the data after the pivot mapping. To avoid unnecessary object pair evaluation, we propose a framework that maps the two involved object sets in order, where the range-object filtering, the double-pivot filtering, the pivot filtering, and the plane sweeping techniques are utilized for pruning. Extensive experiments with both real and synthetic data sets demonstrate that our solutions outperform significantly existing state-of-the-art competitors.
引用
收藏
页码:656 / 669
页数:14
相关论文
共 50 条
  • [41] SEJ: An Even Approach to Multiway Theta-Joins using MapReduce
    Zhang, Changchun
    Li, Jing
    Wu, Lei
    Lin, Meiyan
    Liu, Weiqing
    SECOND INTERNATIONAL CONFERENCE ON CLOUD AND GREEN COMPUTING / SECOND INTERNATIONAL CONFERENCE ON SOCIAL COMPUTING AND ITS APPLICATIONS (CGC/SCA 2012), 2012, : 73 - 80
  • [42] Compact similarity joins
    Bryan, Brent
    Eberhardt, Frederick
    Faloutsos, Christos
    2008 IEEE 24TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3, 2008, : 346 - +
  • [43] Diversity in Similarity Joins
    Santos, Lucio F. D.
    Carvalho, Luiz Olmes
    Oliveira, Willian D.
    Traina, Agma J. M.
    Traina, Caetano, Jr.
    SIMILARITY SEARCH AND APPLICATIONS, SISAP 2015, 2015, 9371 : 42 - 53
  • [44] Secure Similarity Joins Using Fully Homomorphic Encryption
    Cruz, Mateus S. H.
    Amagasa, Toshiyuki
    Watanabe, Chiemi
    Lu, Wenjie
    Kitagawa, Hiroyuki
    19TH INTERNATIONAL CONFERENCE ON INFORMATION INTEGRATION AND WEB-BASED APPLICATIONS & SERVICES (IIWAS2017), 2017, : 224 - 233
  • [45] Sentiment analysis using semantic similarity and Hadoop MapReduce
    Youness Madani
    Mohammed Erritali
    Jamaa Bengourram
    Knowledge and Information Systems, 2019, 59 : 413 - 436
  • [46] Sentiment analysis using semantic similarity and Hadoop MapReduce
    Madani, Youness
    Erritali, Mohammed
    Bengourram, Jamaa
    KNOWLEDGE AND INFORMATION SYSTEMS, 2019, 59 (02) : 413 - 436
  • [47] V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors
    Metwally, Ahmed
    Faloutsos, Christos
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2012, 5 (08): : 704 - 715
  • [48] SharesSkew: An algorithm to handle skew for joins in MapReduce
    Afrati, Foto N.
    Stasinopoulos, Nikos
    Ullman, Jeffrey D.
    Vassilakopoulos, Angelos
    INFORMATION SYSTEMS, 2018, 77 : 129 - 150
  • [49] Optimizing Theta-Joins in a MapReduce Environment
    School of Computer Science and Technology, University of Science and Technology of China, Hefei, 230026, China
    Int. J. Database Theory Appl., 4 (91-108):
  • [50] kNN-DP: Handling Data Skewness in kNN Joins Using MapReduce
    Zhao, Xujun
    Zhang, Jifu
    Qin, Xiao
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2018, 29 (03) : 600 - 613