High-dimensional similarity joins

被引：12

作者：

Shim, K

Srikant, R

Agrawal, R

机构：

[1] Korea Adv Inst Sci & Technol, Yusong Gu, Taejon 305701, South Korea

[2] Adv Informat Technol Res Ctr, Yusong Gu, Taejon 305701, South Korea

[3] IBM Corp, Almaden Res Ctr, San Jose, CA 95120 USA

来源：

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING | 2002年 / 14卷 / 01期

关键词：

data mining; similar time sequences; similarity join;

D O I：

10.1109/69.979979

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Many emerging data mining applications require a similarity join between points in a high-dimensional domain. We present a new algorithm that utilizes a new index structure, called the tree, for fast spatial similarity joins on high-dimensional points. This index structure reduces the number of neighboring leaf nodes that are considered for the join test, as well as the traversal cost of finding appropriate branches in the internal nodes. The storage cost for internal nodes is independent of the number of dimensions. Hence, the proposed index structure scales to high-dimensional data. We analyze the cost of the join for the tree and the R-tree family, and show that the tree will perform better for high-dimensional joins. Empirical evaluation, using synthetic and real-life data sets, shows that similarity join using the tree is twice to an order of magnitude faster than the R tree, with the performance gap increasing with the number of dimensions. We also discuss how some of the ideas of the tree can be applied to the R-tree family. These biased R-trees perform better than the corresponding traditional R-trees for high-dimensional similarity joins, but do not match the performance of the tree.

引用

页码：156 / 171

页数：16

共 50 条

[1] High-dimensional similarity joins
Shim, K
Srikant, R
Agrawal, R
[J]. 13TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING - PROCEEDINGS, 1997, : 301 - 311
[2] Similarity joins for high-dimensional data using Spark
Rong, Chuitian
Cheng, Xiaohai
Chen, Ziliang
Huo, Na
[J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2019, 31 (20):
[3] Parallel similarity joins on massive high-dimensional data using MapReduce
Ma, Youzhong
Meng, Xiaofeng
Wang, Shaoya
[J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2016, 28 (01): : 166 - 183
[4] Comparing MapReduce-Basedk-NN Similarity Joins on Hadoop for High-Dimensional Data
Cech, Premysl
Marousek, Jakub
Lokoc, Jakub
Silva, Yasin N.
Starks, Jeremy
[J]. ADVANCED DATA MINING AND APPLICATIONS, ADMA 2017, 2017, 10604 : 63 - 75
[5] Parallel algorithms for high-dimensional proximity joins
Shafer, JC
Agrawal, R
[J]. PROCEEDINGS OF THE TWENTY-THIRD INTERNATIONAL CONFERENCE ON VERY LARGE DATABASES, 1997, : 176 - 185
[6] High-dimensional kNN joins with incremental updates
Yu, Cui
Zhang, Rui
Huang, Yaochun
Xiong, Hui
[J]. GEOINFORMATICA, 2010, 14 (01) : 55 - 82
[7] High-dimensional kNN joins with incremental updates
Cui Yu
Rui Zhang
Yaochun Huang
Hui Xiong
[J]. GeoInformatica, 2010, 14 : 55 - 82
[8] Pivot-based approximate k-NN similarity joins for big high-dimensional data
Cech, Premysl
Lokoc, Jakub
Silva, Yasin N.
[J]. INFORMATION SYSTEMS, 2020, 87
[9] Progressive high-dimensional similarity join
Tok, Wee Hyong
Bressan, Stephane
Lee, Mong-Li
[J]. DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, 2007, 4653 : 233 - +
[10] High dimensional similarity joins: Algorithms and performance evaluation
Koudas, N
Sevcik, KC
[J]. 14TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 1998, : 466 - 475

← 1 2 3 4 5 →