Parallel similarity joins on massive high-dimensional data using MapReduce

被引：24

作者：

Ma, Youzhong ^{[1
,2
]}

Meng, Xiaofeng ^{[2
]}

Wang, Shaoya ^{[3
]}

机构：

[1] Luoyang Normal Univ, Sch Informat & Technol, Luoyang 47102, Peoples R China

[2] Renmin Univ China, Sch Informat, Beijing 100872, Peoples R China

[3] NEC Labs China, Beijing, Peoples R China

来源：

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE | 2016年 / 28卷 / 01期

关键词：

similarity join; MapReduce; symbolic aggregate approximation; high-dimensional data; piecewise aggregate approximation;

D O I：

10.1002/cpe.3663

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

In this paper, we focus on high-dimensional similarity join (HDSJ) using MapReduce paradigm. As the volume of the data and the number of the dimensions increase, the computation cost of HDSJ will increase exponentially. There is no existing effective approach that can process HDSJ efficiently, so we propose a novel method called symbolic aggregate approximation (SAX)-based HDSJ to deal with the problem. SAX is the abbreviation of symbolic aggregate approximation that is a dimensionality reduction technique and widely used in time series processing, we use SAX to represent the high-dimensional vectors in this paper and reorganize these vectors into groups based on their SAX representations. For the very high-dimensional vectors, we also propose an improved SAX-based HDSJ approach. Finally, we implement SAX-based HDSJ and improved SAX-based HDSJ on Hadoop-0.20.2 and perform comprehensive experiments to test the performance, we also compare SAX-based HDSJ and improved SAX-based HDSJ with the existing method. The experiment results show that our proposed approaches have much better performance than that of the existing method. Copyright (c) 2015 John Wiley & Sons, Ltd.

引用

页码：166 / 183

页数：18

共 50 条

[1] Similarity joins for high-dimensional data using Spark
Rong, Chuitian
Cheng, Xiaohai
Chen, Ziliang
Huo, Na
[J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2019, 31 (20):
[2] Comparing MapReduce-Basedk-NN Similarity Joins on Hadoop for High-Dimensional Data
Cech, Premysl
Marousek, Jakub
Lokoc, Jakub
Silva, Yasin N.
Starks, Jeremy
[J]. ADVANCED DATA MINING AND APPLICATIONS, ADMA 2017, 2017, 10604 : 63 - 75
[3] High-dimensional similarity joins
Shim, K
Srikant, R
Agrawal, R
[J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2002, 14 (01) : 156 - 171
[4] High-dimensional similarity joins
Shim, K
Srikant, R
Agrawal, R
[J]. 13TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING - PROCEEDINGS, 1997, : 301 - 311
[5] PHiDJ: Parallel Similarity Self-Join for High-Dimensional Vector Data with MapReduce
Fries, Sergej
Boden, Brigitte
Stepien, Grzegorz
Seidl, Thomas
[J]. 2014 IEEE 30TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2014, : 796 - 807
[6] Parallel algorithms for high-dimensional proximity joins
Shafer, JC
Agrawal, R
[J]. PROCEEDINGS OF THE TWENTY-THIRD INTERNATIONAL CONFERENCE ON VERY LARGE DATABASES, 1997, : 176 - 185
[7] Projection Based Large Scale High-Dimensional Data Similarity Join Using MapReduce Framework
Ma, Youzhong
Zhang, Ruiling
Cui, Zhanyou
Lin, Chunjie
[J]. IEEE ACCESS, 2020, 8 : 121665 - 121677
[8] Metric Similarity Joins Using MapReduce
Chen, Gang
Yang, Keyu
Chen, Lu
Gao, Yunjun
Zheng, Baihua
Chen, Chun
[J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2017, 29 (03) : 656 - 669
[9] SCEA: A Parallel Clustering Ensemble Algorithm for High-Dimensional Massive Data
Liao, Bin
Huang, Jing-Lai
Wang, Xin
Sun, Rui-Na
Ge, Xiao-Yan
Guo, Bing-Lei
[J]. Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2021, 49 (06): : 1077 - 1087
[10] Privacy preserving similarity joins using MapReduce
Ding, Xiaofeng
Yang, Wanlu
Choo, Kim-Kwang Raymond
Wang, Xiaoli
Jin, Hai
[J]. INFORMATION SCIENCES, 2019, 493 : 20 - 33

← 1 2 3 4 5 →