Epsilon grid order:: An algorithm for the similarity join on massive high-dimensional data

被引：0

作者：

Böhm, C ^{[1
]}

Braunmüller, B ^{[1
]}

Krebs, F ^{[1
]}

Kriege, HP ^{[1
]}

机构：

[1] Univ Munich, Inst Comp Sci, D-80538 Munich, Germany

来源：

SIGMOD RECORD | 2001年 / 30卷 / 02期

关键词：

similarity join; high-dimensional space; data mining; knowledge discovery; similarity search; feature transformation;

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The similarity join is an important database primitive which has been successfully applied to speed up applications such as similarity search, data analysis and data mining. The similarity join combines two point sets of a multidimensional vector space such that the result contains all point pairs where the distance does not exceed a parameter E. In this paper, we propose the Epsilon Grid Order, a new algorithm for determining the similarity join of very large data sets. Our solution is based on a particular sort order of the data points, which is obtained by laying an equi-distant grid with cell length E over the data space and comparing the grid cells lexicographically. A typical problem of grid-based approaches such as MSJ or the epsilon -kdB-tree is that large portions of the data sets must be held simultaneously in main memory. Therefore, these approaches do not scale to large data sets. Our technique avoids this problem by an external sorting algorithm and a particular scheduling strategy during the join phase. In the experimental evaluation, a substantial improvement over competitive techniques is shown.

引用

页码：379 / 388

页数：10

共 50 条

[1] Progressive high-dimensional similarity join
Tok, Wee Hyong
Bressan, Stephane
Lee, Mong-Li
[J]. DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, 2007, 4653 : 233 - +
[2] A Δ-tree based similarity join processing for high-dimensional data
Liu, Yan
Hao, Zhongxiao
[J]. Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2009, 46 (06): : 995 - 1002
[3] k Nearest Neighbor Similarity Join Algorithm on High-Dimensional Data Using Novel Partitioning Strategy
Ma, Youzhong
Hua, Qiaozhi
Wen, Zheng
Zhang, Ruiling
Zhang, Yongxin
Li, Haipeng
[J]. SECURITY AND COMMUNICATION NETWORKS, 2022, 2022
[4] Parallel similarity joins on massive high-dimensional data using MapReduce
Ma, Youzhong
Meng, Xiaofeng
Wang, Shaoya
[J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2016, 28 (01): : 166 - 183
[5] A KNN-join algorithm based on Δ-tree for high-dimensional data
Liu, Yan
Hao, Zhongxiao
[J]. Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2010, 47 (07): : 1234 - 1243
[6] PHiDJ: Parallel Similarity Self-Join for High-Dimensional Vector Data with MapReduce
Fries, Sergej
Boden, Brigitte
Stepien, Grzegorz
Seidl, Thomas
[J]. 2014 IEEE 30TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2014, : 796 - 807
[7] A novel approach for high-dimensional vector similarity join query
Ma, Youzhong
Jia, Shijie
Zhang, Yongxin
[J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2017, 29 (05):
[8] An efficient parallel algorithm for high dimensional similarity join
Alsabti, K
Ranka, S
Singh, V
[J]. FIRST MERGED INTERNATIONAL PARALLEL PROCESSING SYMPOSIUM & SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING, 1998, : 556 - 560
[9] SCEA: A Parallel Clustering Ensemble Algorithm for High-Dimensional Massive Data
Liao, Bin
Huang, Jing-Lai
Wang, Xin
Sun, Rui-Na
Ge, Xiao-Yan
Guo, Bing-Lei
[J]. Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2021, 49 (06): : 1077 - 1087
[10] Projection Based Large Scale High-Dimensional Data Similarity Join Using MapReduce Framework
Ma, Youzhong
Zhang, Ruiling
Cui, Zhanyou
Lin, Chunjie
[J]. IEEE ACCESS, 2020, 8 : 121665 - 121677

← 1 2 3 4 5 →