Strategic and suave processing for performing similarity joins using MapReduce

被引:1
|
作者
Lakshminarayanan, Mahalakshmi [1 ]
Acosta, William F. [2 ]
Green, Robert C., II [3 ]
Devabhaktuni, Vijay [1 ]
机构
[1] Univ Toledo, Toledo, OH 43606 USA
[2] Harman Int, Vernon Hills, IL 60061 USA
[3] Bowling Green State Univ, Dept Comp Sci, Bowling Green, OH 43403 USA
来源
JOURNAL OF SUPERCOMPUTING | 2014年 / 69卷 / 02期
关键词
Similarity Joins; Multisets; MapReduce;
D O I
10.1007/s11227-014-1197-7
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
An efficient MapReduce Algorithm for performing Similarity Joins between multisets is proposed. Filtering techniques for similarity joins minimize the number of pairs of entities joined and hence improve the efficiency of the algorithm. Multisets represent real-world data better by considering the frequency of its elements. Prior serial algorithms incorporate filtering techniques only for sets, but not multisets, while prior MapReduce algorithms do not incorporate any filtering technique or inefficiently and unscalably incorporate prefix filtering. This work extends the filtering techniques, namely the prefix, size and positional to multisets, and also achieves the challenging task of efficiently incorporating them in the shared-nothing MapReduce model, thereby minimizing the pairs generated and joined, resulting in I/O, network and computational efficiency. A technique to enhance the scalability of the algorithm is also presented as a contingency need. Algorithms are developed using Hadoop and tested using real-world Twitter data. Experimental results demonstrate unprecedented performance gain.
引用
收藏
页码:930 / 954
页数:25
相关论文
共 50 条
  • [21] Optimizing Distributed Joins with Bloom Filters Using MapReduce
    Zhang, Changchun
    Wu, Lei
    Li, Jing
    [J]. COMPUTER APPLICATIONS FOR GRAPHICS, GRID COMPUTING, AND INDUSTRIAL ENVIRONMENT, 2012, 351 : 88 - 95
  • [22] MapReduce Based Personalized Locality Sensitive Hashing for Similarity Joins on Large Scale Data
    Wang, Jingjing
    Lin, Chen
    [J]. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2015, 2015
  • [23] Multidimensional Similarity Join Using MapReduce
    Li, Ye
    Wang, Jian
    Hou, Leong U.
    [J]. WEB-AGE INFORMATION MANAGEMENT, PT II, 2016, 9659 : 457 - 468
  • [24] Comparing MapReduce-Basedk-NN Similarity Joins on Hadoop for High-Dimensional Data
    Cech, Premysl
    Marousek, Jakub
    Lokoc, Jakub
    Silva, Yasin N.
    Starks, Jeremy
    [J]. ADVANCED DATA MINING AND APPLICATIONS, ADMA 2017, 2017, 10604 : 63 - 75
  • [25] Accelerating Set Similarity Joins Using GPUs
    Cruz, Mateus S. H.
    Kozawa, Yusuke
    Amagasa, Toshiyuki
    Kitagawa, Hiroyuki
    [J]. TRANSACTIONS ON LARGE-SCALE DATA- AND KNOWLEDGE-CENTERED SYSTEMS XXVIII: SPECIAL ISSUE ON DATABASE- AND EXPERT-SYSTEMS APPLICATIONS, 2016, 9940 : 1 - 22
  • [26] A Lightweight Indexing Approach for Efficient Batch Similarity Processing with MapReduce
    Phan T.N.
    Dang T.K.
    [J]. SN Computer Science, 2020, 1 (1)
  • [27] XML Structural Similarity Search Using MapReduce
    Yuan, Peisen
    Sha, Chaofeng
    Wang, Xiaoling
    Yang, Bin
    Zhou, Aoying
    Yang, Su
    [J]. WEB-AGE INFORMATION MANAGEMENT, PROCEEDINGS, 2010, 6184 : 169 - +
  • [28] Scalable Metric Similarity Join using MapReduce
    Wu, Jiacheng
    Zhang, Yong
    Wang, Jin
    Lin, Chunbin
    Fu, Yingjia
    Xing, Chunxiao
    [J]. 2019 IEEE 35TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2019), 2019, : 1662 - 1665
  • [29] Detecting Text Similarity Using MapReduce Framework
    Birjali, Marouane
    Beni-Hssane, Abderrahim
    Erritali, Mohammed
    Madani, Youness
    [J]. EUROPE AND MENA COOPERATION ADVANCES IN INFORMATION AND COMMUNICATION TECHNOLOGIES, 2017, 520 : 383 - 389
  • [30] Performing Bayesian Inference using Apache Hadoop MapReduce
    Jongsawat, Nipat
    Premchaiswadi, Wichian
    [J]. INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND SOFTWARE ENGINEERING (AISE 2014), 2014, : 420 - 424