Epsilon grid order:: An algorithm for the similarity join on massive high-dimensional data

被引:0
|
作者
Böhm, C [1 ]
Braunmüller, B [1 ]
Krebs, F [1 ]
Kriege, HP [1 ]
机构
[1] Univ Munich, Inst Comp Sci, D-80538 Munich, Germany
关键词
similarity join; high-dimensional space; data mining; knowledge discovery; similarity search; feature transformation;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The similarity join is an important database primitive which has been successfully applied to speed up applications such as similarity search, data analysis and data mining. The similarity join combines two point sets of a multidimensional vector space such that the result contains all point pairs where the distance does not exceed a parameter E. In this paper, we propose the Epsilon Grid Order, a new algorithm for determining the similarity join of very large data sets. Our solution is based on a particular sort order of the data points, which is obtained by laying an equi-distant grid with cell length E over the data space and comparing the grid cells lexicographically. A typical problem of grid-based approaches such as MSJ or the epsilon -kdB-tree is that large portions of the data sets must be held simultaneously in main memory. Therefore, these approaches do not scale to large data sets. Our technique avoids this problem by an external sorting algorithm and a particular scheduling strategy during the join phase. In the experimental evaluation, a substantial improvement over competitive techniques is shown.
引用
收藏
页码:379 / 388
页数:10
相关论文
共 50 条
  • [1] Progressive high-dimensional similarity join
    Tok, Wee Hyong
    Bressan, Stephane
    Lee, Mong-Li
    [J]. DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, 2007, 4653 : 233 - +
  • [2] A Δ-tree based similarity join processing for high-dimensional data
    Liu, Yan
    Hao, Zhongxiao
    [J]. Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2009, 46 (06): : 995 - 1002
  • [3] k Nearest Neighbor Similarity Join Algorithm on High-Dimensional Data Using Novel Partitioning Strategy
    Ma, Youzhong
    Hua, Qiaozhi
    Wen, Zheng
    Zhang, Ruiling
    Zhang, Yongxin
    Li, Haipeng
    [J]. SECURITY AND COMMUNICATION NETWORKS, 2022, 2022
  • [4] Parallel similarity joins on massive high-dimensional data using MapReduce
    Ma, Youzhong
    Meng, Xiaofeng
    Wang, Shaoya
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2016, 28 (01): : 166 - 183
  • [5] A KNN-join algorithm based on Δ-tree for high-dimensional data
    Liu, Yan
    Hao, Zhongxiao
    [J]. Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2010, 47 (07): : 1234 - 1243
  • [6] PHiDJ: Parallel Similarity Self-Join for High-Dimensional Vector Data with MapReduce
    Fries, Sergej
    Boden, Brigitte
    Stepien, Grzegorz
    Seidl, Thomas
    [J]. 2014 IEEE 30TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2014, : 796 - 807
  • [7] A novel approach for high-dimensional vector similarity join query
    Ma, Youzhong
    Jia, Shijie
    Zhang, Yongxin
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2017, 29 (05):
  • [8] An efficient parallel algorithm for high dimensional similarity join
    Alsabti, K
    Ranka, S
    Singh, V
    [J]. FIRST MERGED INTERNATIONAL PARALLEL PROCESSING SYMPOSIUM & SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING, 1998, : 556 - 560
  • [9] SCEA: A Parallel Clustering Ensemble Algorithm for High-Dimensional Massive Data
    Liao, Bin
    Huang, Jing-Lai
    Wang, Xin
    Sun, Rui-Na
    Ge, Xiao-Yan
    Guo, Bing-Lei
    [J]. Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2021, 49 (06): : 1077 - 1087
  • [10] Projection Based Large Scale High-Dimensional Data Similarity Join Using MapReduce Framework
    Ma, Youzhong
    Zhang, Ruiling
    Cui, Zhanyou
    Lin, Chunjie
    [J]. IEEE ACCESS, 2020, 8 : 121665 - 121677