Leveraging set relations in exact and dynamic set similarity join

被引:0
|
作者
Xubo Wang
Lu Qin
Xuemin Lin
Ying Zhang
Lijun Chang
机构
[1] University of New South Wales,
[2] University of Technology Sydney,undefined
[3] The University of Sydney,undefined
来源
The VLDB Journal | 2019年 / 28卷
关键词
Incremental algorithm; Set similarity join; Set relations;
D O I
暂无
中图分类号
学科分类号
摘要
Set similarity join, which finds all the similar set pairs from two collections of sets, is a fundamental problem with a wide range of applications. Existing works study both exact set similarity join and approximate similarity join problems. In this paper, we focus on the exact set similarity join problem. The existing solutions for exact set similarity join follow a filtering-verification framework, which generates a list of candidate pairs through scanning indexes in the filtering phase and reports those similar pairs in the verification phase. Though much research has been conducted on this problem, set relations have not been well studied on improving the algorithm efficiency through computational cost sharing. Therefore, in this paper, we explore the set relations in different levels to reduce the overall computational cost. First, it has been shown that most of the computational time is spent on the filtering phase, which can be quadratic to the number of sets in the worst case for the existing solutions. Thus, we explore index-level set relations to reduce the filtering cost while keeping the same filtering power. We achieve this by grouping related sets into blocks in the index and skipping useless index probes in joins. Second, we explore answer-level set relations to further improve the algorithm based on the intuition that if two sets are similar, their answers may have a large overlap. We derive an algorithm which incrementally generates the answer of one set from an already computed answer of another similar set rather than compute the answer from scratch to reduce the computational cost. In addition, considering that in real applications, the data are usually updated dynamically, we extend our techniques and design efficient algorithms to incrementally update the join result when any element in the sets is updated. Finally, we conduct extensive performance studies using 21 real datasets with various data properties from a wide range of domains. The experimental results demonstrate that our algorithm outperforms all the existing algorithms across all datasets.
引用
收藏
页码:267 / 292
页数:25
相关论文
共 50 条
  • [21] Power-Law Based Estimation of Set Similarity Join Size
    Lee, Hongrae
    Ng, Raymond T.
    Shim, Kyuseok
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2009, 2 (01): : 658 - 669
  • [22] JOIN THE JET SET
    不详
    MANUFACTURING CHEMIST, 1989, 60 (09): : 24 - 26
  • [23] An Efficient Partition Based Method for Exact Set Similarity Joins
    Deng, Dong
    Li, Guoliang
    Wen, He
    Feng, Jianhua
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2015, 9 (04): : 360 - 371
  • [24] Exact Set Similarity Joins for Large Datasets in the GPGPU paradigm
    Bellas, Christos
    Gounaris, Anastasios
    15TH INTERNATIONAL WORKSHOP ON DATA MANAGEMENT ON NEW HARDWARE (DAMON 2019), 2019,
  • [25] A Set Similarity Self-Join Algorithm with FP-tree and MapReduce
    Feng Y.
    Wu K.
    Huang Z.
    Feng Y.
    Chen H.
    Bai J.
    Ming Z.
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2023, 60 (12): : 2890 - 2906
  • [26] TokenJoin: Efficient Filtering for Set Similarity Join with Maximum Weighted Bipartite Matching
    Zeakis, Alexandros
    Skoutas, Dimitrios
    Sacharidis, Dimitris
    Papapetrou, Odysseas
    Koubarakis, Manolis
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2022, 16 (04): : 790 - 802
  • [27] How improve Set Similarity Join based on prefix approach in distributed environment
    Zhu, Song
    Gagliardelli, Luca
    Simonini, Giovanni
    Beneventano, Domenico
    PROCEEDINGS 2018 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING & SIMULATION (HPCS), 2018, : 844 - 851
  • [28] Set approximation and its interpretation on vagueness based on similarity relations
    An, Li-Ping
    Chen, Zeng-Qiang
    Xi Tong Gong Cheng Yu Dian Zi Ji Shu/Systems Engineering and Electronics, 2009, 31 (10): : 2384 - 2388
  • [29] A NEW APPROACH TO THE SIMILARITY RELATIONS IN THE FUZZY SET-THEORY
    FAUROUS, P
    FILLARD, JP
    INFORMATION SCIENCES, 1993, 75 (03) : 213 - 221
  • [30] Set similarity modulates object tracking in dynamic environments
    Sibel Akyuz
    Jaap Munneke
    Jennifer E. Corbett
    Attention, Perception, & Psychophysics, 2018, 80 : 1744 - 1751