Parallel similarity joins on massive high-dimensional data using MapReduce

被引:24
|
作者
Ma, Youzhong [1 ,2 ]
Meng, Xiaofeng [2 ]
Wang, Shaoya [3 ]
机构
[1] Luoyang Normal Univ, Sch Informat & Technol, Luoyang 47102, Peoples R China
[2] Renmin Univ China, Sch Informat, Beijing 100872, Peoples R China
[3] NEC Labs China, Beijing, Peoples R China
来源
关键词
similarity join; MapReduce; symbolic aggregate approximation; high-dimensional data; piecewise aggregate approximation;
D O I
10.1002/cpe.3663
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
In this paper, we focus on high-dimensional similarity join (HDSJ) using MapReduce paradigm. As the volume of the data and the number of the dimensions increase, the computation cost of HDSJ will increase exponentially. There is no existing effective approach that can process HDSJ efficiently, so we propose a novel method called symbolic aggregate approximation (SAX)-based HDSJ to deal with the problem. SAX is the abbreviation of symbolic aggregate approximation that is a dimensionality reduction technique and widely used in time series processing, we use SAX to represent the high-dimensional vectors in this paper and reorganize these vectors into groups based on their SAX representations. For the very high-dimensional vectors, we also propose an improved SAX-based HDSJ approach. Finally, we implement SAX-based HDSJ and improved SAX-based HDSJ on Hadoop-0.20.2 and perform comprehensive experiments to test the performance, we also compare SAX-based HDSJ and improved SAX-based HDSJ with the existing method. The experiment results show that our proposed approaches have much better performance than that of the existing method. Copyright (c) 2015 John Wiley & Sons, Ltd.
引用
收藏
页码:166 / 183
页数:18
相关论文
共 50 条
  • [1] Similarity joins for high-dimensional data using Spark
    Rong, Chuitian
    Cheng, Xiaohai
    Chen, Ziliang
    Huo, Na
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2019, 31 (20):
  • [2] Comparing MapReduce-Basedk-NN Similarity Joins on Hadoop for High-Dimensional Data
    Cech, Premysl
    Marousek, Jakub
    Lokoc, Jakub
    Silva, Yasin N.
    Starks, Jeremy
    [J]. ADVANCED DATA MINING AND APPLICATIONS, ADMA 2017, 2017, 10604 : 63 - 75
  • [3] High-dimensional similarity joins
    Shim, K
    Srikant, R
    Agrawal, R
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2002, 14 (01) : 156 - 171
  • [4] High-dimensional similarity joins
    Shim, K
    Srikant, R
    Agrawal, R
    [J]. 13TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING - PROCEEDINGS, 1997, : 301 - 311
  • [5] PHiDJ: Parallel Similarity Self-Join for High-Dimensional Vector Data with MapReduce
    Fries, Sergej
    Boden, Brigitte
    Stepien, Grzegorz
    Seidl, Thomas
    [J]. 2014 IEEE 30TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2014, : 796 - 807
  • [6] Parallel algorithms for high-dimensional proximity joins
    Shafer, JC
    Agrawal, R
    [J]. PROCEEDINGS OF THE TWENTY-THIRD INTERNATIONAL CONFERENCE ON VERY LARGE DATABASES, 1997, : 176 - 185
  • [7] Projection Based Large Scale High-Dimensional Data Similarity Join Using MapReduce Framework
    Ma, Youzhong
    Zhang, Ruiling
    Cui, Zhanyou
    Lin, Chunjie
    [J]. IEEE ACCESS, 2020, 8 : 121665 - 121677
  • [8] Metric Similarity Joins Using MapReduce
    Chen, Gang
    Yang, Keyu
    Chen, Lu
    Gao, Yunjun
    Zheng, Baihua
    Chen, Chun
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2017, 29 (03) : 656 - 669
  • [9] SCEA: A Parallel Clustering Ensemble Algorithm for High-Dimensional Massive Data
    Liao, Bin
    Huang, Jing-Lai
    Wang, Xin
    Sun, Rui-Na
    Ge, Xiao-Yan
    Guo, Bing-Lei
    [J]. Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2021, 49 (06): : 1077 - 1087
  • [10] Privacy preserving similarity joins using MapReduce
    Ding, Xiaofeng
    Yang, Wanlu
    Choo, Kim-Kwang Raymond
    Wang, Xiaoli
    Jin, Hai
    [J]. INFORMATION SCIENCES, 2019, 493 : 20 - 33