Fast and scalable vector similarity joins with MapReduce

被引:0
|
作者
Byoungju Yang
Hyun Joon Kim
Junho Shim
Dongjoo Lee
Sang-goo Lee
机构
[1] Oracle Corporation,Database Server Technology
[2] Seoul National University,School of Computer Science and Engineering
[3] Sookmyung Women’s University,Department of Computer Science
[4] Samsung Electronics Co.,Software R&D Center
关键词
Similarity join; MapReduce; Cosine similarity; Filtering;
D O I
暂无
中图分类号
学科分类号
摘要
Vector similarity join, which finds similar pairs of vector objects, is a computationally expensive process. As its number of vectors increases, the time needed for join operation increases proportional to the square of the number of vectors. Various filtering techniques have been proposed to reduce its computational load. On the other hand, MapReduce algorithms have been studied to manage large datasets. The recent improvements, however, still suffer from its computational time and scalability. In this paper, we propose a MapReduce algorithm FACET(FAst and sCalable maprEduce similariTy join) to efficiently solve the vector similarity join problem on large datasets. FACET is an all-pair exact join algorithm, composed of two stages. In the first stage, we use our own novel filtering techniques to eliminate dissimilar pairs to generate non-redundant candidate pairs. The second stage matches candidate pairs with the vector data so that similar pairs are produced as the output. Both stages employ parallelism offered by MapReduce. The algorithm is currently designed for cosine similarity and Self Join case. Extensions to other similarity measures and R-S Join case are also discussed. We provide the I/O analysis of the algorithm. We evaluate the performance of the algorithm on multiple real world datasets. The experiment results show that our algorithm performs, on average, 40 % upto 800 % better than the previous state-of-the-art MapReduce algorithms.
引用
收藏
页码:473 / 497
页数:24
相关论文
共 50 条
  • [1] Fast and scalable vector similarity joins with MapReduce
    Yang, Byoungju
    Kim, Hyun Joon
    Shim, Junho
    Lee, Dongjoo
    Lee, Sang-goo
    [J]. JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2016, 46 (03) : 473 - 497
  • [2] Efficient and Scalable Graph Similarity Joins in MapReduce
    Chen, Yifan
    Zhao, Xiang
    Xiao, Chuan
    Zhang, Weiming
    Tang, Jiuyang
    [J]. SCIENTIFIC WORLD JOURNAL, 2014,
  • [3] Practising Scalable Graph Similarity Joins in MapReduce
    Chen, Yifan
    Zhao, Xiang
    Ge, Bin
    Xiao, Chuan
    Chi, Chi-Hung
    [J]. 2014 IEEE INTERNATIONAL CONGRESS ON BIG DATA (BIGDATA CONGRESS), 2014, : 112 - 119
  • [4] MassJoin: A MapReduce-based Method for Scalable String Similarity Joins
    Deng, Dong
    Li, Guoliang
    Hao, Shuang
    Wang, Jiannan
    Feng, Jianhua
    [J]. 2014 IEEE 30TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2014, : 340 - 351
  • [5] Metric Similarity Joins Using MapReduce
    Chen, Gang
    Yang, Keyu
    Chen, Lu
    Gao, Yunjun
    Zheng, Baihua
    Chen, Chun
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2017, 29 (03) : 656 - 669
  • [6] Scalable Similarity Joins for Fast and Accurate Record Deduplication in Big Data
    Rozinek, Ondrej
    Borkovcova, Monika
    Mares, Jan
    [J]. GOOD PRACTICES AND NEW PERSPECTIVES IN INFORMATION SYSTEMS AND TECHNOLOGIES, VOL 6, WORLDCIST 2024, 2024, 990 : 181 - 191
  • [7] Fast and Scalable Distributed Set Similarity Joins for Big Data Analytics
    Rong, Chuitian
    Lin, Chunbin
    Silva, Yasin N.
    Wang, Jianguo
    Lu, Wei
    Du, Xiaoyong
    [J]. 2017 IEEE 33RD INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2017), 2017, : 1059 - 1070
  • [8] Set Similarity Joins on MapReduce: An Experimental Survey
    Fier, Fabian
    Augsten, Nikolaus
    Bouros, Panagiotis
    Leser, Ulf
    Freytag, Johann-Christoph
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2018, 11 (10): : 1110 - 1122
  • [9] Privacy preserving similarity joins using MapReduce
    Ding, Xiaofeng
    Yang, Wanlu
    Choo, Kim-Kwang Raymond
    Wang, Xiaoli
    Jin, Hai
    [J]. INFORMATION SCIENCES, 2019, 493 : 20 - 33
  • [10] Scalable Similarity Joins of Tokenized Strings
    Metwally, Ahmed
    Huang, Chun-Heng
    [J]. 2019 IEEE 35TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2019), 2019, : 1766 - 1777