High-dimensional similarity joins

被引:12
|
作者
Shim, K
Srikant, R
Agrawal, R
机构
[1] Korea Adv Inst Sci & Technol, Yusong Gu, Taejon 305701, South Korea
[2] Adv Informat Technol Res Ctr, Yusong Gu, Taejon 305701, South Korea
[3] IBM Corp, Almaden Res Ctr, San Jose, CA 95120 USA
关键词
data mining; similar time sequences; similarity join;
D O I
10.1109/69.979979
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many emerging data mining applications require a similarity join between points in a high-dimensional domain. We present a new algorithm that utilizes a new index structure, called the tree, for fast spatial similarity joins on high-dimensional points. This index structure reduces the number of neighboring leaf nodes that are considered for the join test, as well as the traversal cost of finding appropriate branches in the internal nodes. The storage cost for internal nodes is independent of the number of dimensions. Hence, the proposed index structure scales to high-dimensional data. We analyze the cost of the join for the tree and the R-tree family, and show that the tree will perform better for high-dimensional joins. Empirical evaluation, using synthetic and real-life data sets, shows that similarity join using the tree is twice to an order of magnitude faster than the R tree, with the performance gap increasing with the number of dimensions. We also discuss how some of the ideas of the tree can be applied to the R-tree family. These biased R-trees perform better than the corresponding traditional R-trees for high-dimensional similarity joins, but do not match the performance of the tree.
引用
收藏
页码:156 / 171
页数:16
相关论文
共 50 条
  • [1] High-dimensional similarity joins
    Shim, K
    Srikant, R
    Agrawal, R
    [J]. 13TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING - PROCEEDINGS, 1997, : 301 - 311
  • [2] Similarity joins for high-dimensional data using Spark
    Rong, Chuitian
    Cheng, Xiaohai
    Chen, Ziliang
    Huo, Na
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2019, 31 (20):
  • [3] Parallel similarity joins on massive high-dimensional data using MapReduce
    Ma, Youzhong
    Meng, Xiaofeng
    Wang, Shaoya
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2016, 28 (01): : 166 - 183
  • [4] Comparing MapReduce-Basedk-NN Similarity Joins on Hadoop for High-Dimensional Data
    Cech, Premysl
    Marousek, Jakub
    Lokoc, Jakub
    Silva, Yasin N.
    Starks, Jeremy
    [J]. ADVANCED DATA MINING AND APPLICATIONS, ADMA 2017, 2017, 10604 : 63 - 75
  • [5] Parallel algorithms for high-dimensional proximity joins
    Shafer, JC
    Agrawal, R
    [J]. PROCEEDINGS OF THE TWENTY-THIRD INTERNATIONAL CONFERENCE ON VERY LARGE DATABASES, 1997, : 176 - 185
  • [6] High-dimensional kNN joins with incremental updates
    Yu, Cui
    Zhang, Rui
    Huang, Yaochun
    Xiong, Hui
    [J]. GEOINFORMATICA, 2010, 14 (01) : 55 - 82
  • [7] High-dimensional kNN joins with incremental updates
    Cui Yu
    Rui Zhang
    Yaochun Huang
    Hui Xiong
    [J]. GeoInformatica, 2010, 14 : 55 - 82
  • [8] Pivot-based approximate k-NN similarity joins for big high-dimensional data
    Cech, Premysl
    Lokoc, Jakub
    Silva, Yasin N.
    [J]. INFORMATION SYSTEMS, 2020, 87
  • [9] Progressive high-dimensional similarity join
    Tok, Wee Hyong
    Bressan, Stephane
    Lee, Mong-Li
    [J]. DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, 2007, 4653 : 233 - +
  • [10] High dimensional similarity joins: Algorithms and performance evaluation
    Koudas, N
    Sevcik, KC
    [J]. 14TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 1998, : 466 - 475