An Empirical Evaluation of Set Similarity Join Techniques

被引:0
|
作者
Mann, Willi [1 ]
Augsten, Nikolaus [1 ]
Bouros, Panagiotis [2 ]
机构
[1] Salzburg Univ, Salzburg, Austria
[2] Aarhus Univ, Aarhus, Denmark
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2016年 / 9卷 / 09期
基金
奥地利科学基金会;
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Set similarity joins compute all pairs of similar sets from two collections of sets. We conduct extensive experiments on seven state-of-the-art algorithms for set similarity joins. These algorithms adopt a filter-verification approach. Our analysis shows that verification has not received enough attention in previous works. In practice, efficient verification inspects only a small, constant number of set elements and is faster than some of the more sophisticated filter techniques. Although we can identify three winners, we find that most algorithms show very similar performance. The key technique is the prefix filter, and AllPairs, the first algorithm adopting this techniques is still a relevant competitor. We repeat experiments from previous work and discuss diverging results. All our claims are supported by a detailed analysis of the factors that determine the overall runtime.
引用
收藏
页码:636 / 647
页数:12
相关论文
共 50 条
  • [21] Accelerating Progressive Set Similarity Join with the CPU-GPU Architecture
    Yu, Lining
    Nie, Tiezheng
    Shen, Derong
    Kou, Yue
    BIG DATA RESEARCH, 2021, 26
  • [22] Power-Law Based Estimation of Set Similarity Join Size
    Lee, Hongrae
    Ng, Raymond T.
    Shim, Kyuseok
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2009, 2 (01): : 658 - 669
  • [23] HySet: A hybrid framework for exact set similarity join using a GPU
    Bellas, Christos
    Gounaris, Anastasios
    PARALLEL COMPUTING, 2021, 104
  • [24] A Set Similarity Self-Join Algorithm with FP-tree and MapReduce
    Feng Y.
    Wu K.
    Huang Z.
    Feng Y.
    Chen H.
    Bai J.
    Ming Z.
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2023, 60 (12): : 2890 - 2906
  • [25] TokenJoin: Efficient Filtering for Set Similarity Join with Maximum Weighted Bipartite Matching
    Zeakis, Alexandros
    Skoutas, Dimitrios
    Sacharidis, Dimitris
    Papapetrou, Odysseas
    Koubarakis, Manolis
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2022, 16 (04): : 790 - 802
  • [26] How improve Set Similarity Join based on prefix approach in distributed environment
    Zhu, Song
    Gagliardelli, Luca
    Simonini, Giovanni
    Beneventano, Domenico
    PROCEEDINGS 2018 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING & SIMULATION (HPCS), 2018, : 844 - 851
  • [27] A near-optimal similarity join algorithm and performance evaluation
    Yang, ZW
    Yang, GQ
    INFORMATION SCIENCES, 2004, 167 (1-4) : 87 - 108
  • [28] Similarity Join and Similarity Self-Join Size Estimation in a Streaming Environment
    Rafiei, Davood
    Deng, Fan
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2020, 32 (04) : 768 - 781
  • [29] An Empirical Study on Document Similarity Comparison Evaluation Between Machine Learning Techniques and Human Experts
    Jang, Won-Jung
    TEHNICKI VJESNIK-TECHNICAL GAZETTE, 2024, 31 (05): : 1668 - 1679
  • [30] Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join
    Lu, Jiaheng
    Lin, Chunbin
    Wang, Jin
    Li, Chen
    PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM '19), 2019, : 2975 - 2976