An Empirical Evaluation of Set Similarity Join Techniques

被引：0

作者：

Mann, Willi ^{[1
]}

Augsten, Nikolaus ^{[1
]}

Bouros, Panagiotis ^{[2
]}

机构：

[1] Salzburg Univ, Salzburg, Austria

[2] Aarhus Univ, Aarhus, Denmark

来源：

PROCEEDINGS OF THE VLDB ENDOWMENT | 2016年 / 9卷 / 09期

基金：

奥地利科学基金会;

关键词：

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Set similarity joins compute all pairs of similar sets from two collections of sets. We conduct extensive experiments on seven state-of-the-art algorithms for set similarity joins. These algorithms adopt a filter-verification approach. Our analysis shows that verification has not received enough attention in previous works. In practice, efficient verification inspects only a small, constant number of set elements and is faster than some of the more sophisticated filter techniques. Although we can identify three winners, we find that most algorithms show very similar performance. The key technique is the prefix filter, and AllPairs, the first algorithm adopting this techniques is still a relevant competitor. We repeat experiments from previous work and discuss diverging results. All our claims are supported by a detailed analysis of the factors that determine the overall runtime.

引用

页码：636 / 647

页数：12

共 50 条

[21] Accelerating Progressive Set Similarity Join with the CPU-GPU Architecture
Yu, Lining
Nie, Tiezheng
Shen, Derong
Kou, Yue
BIG DATA RESEARCH, 2021, 26
[22] Power-Law Based Estimation of Set Similarity Join Size
Lee, Hongrae
Ng, Raymond T.
Shim, Kyuseok
PROCEEDINGS OF THE VLDB ENDOWMENT, 2009, 2 (01): : 658 - 669
[23] HySet: A hybrid framework for exact set similarity join using a GPU
Bellas, Christos
Gounaris, Anastasios
PARALLEL COMPUTING, 2021, 104
[24] A Set Similarity Self-Join Algorithm with FP-tree and MapReduce
Feng Y.
Wu K.
Huang Z.
Feng Y.
Chen H.
Bai J.
Ming Z.
Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2023, 60 (12): : 2890 - 2906
[25] TokenJoin: Efficient Filtering for Set Similarity Join with Maximum Weighted Bipartite Matching
Zeakis, Alexandros
Skoutas, Dimitrios
Sacharidis, Dimitris
Papapetrou, Odysseas
Koubarakis, Manolis
PROCEEDINGS OF THE VLDB ENDOWMENT, 2022, 16 (04): : 790 - 802
[26] How improve Set Similarity Join based on prefix approach in distributed environment
Zhu, Song
Gagliardelli, Luca
Simonini, Giovanni
Beneventano, Domenico
PROCEEDINGS 2018 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING & SIMULATION (HPCS), 2018, : 844 - 851
[27] A near-optimal similarity join algorithm and performance evaluation
Yang, ZW
Yang, GQ
INFORMATION SCIENCES, 2004, 167 (1-4) : 87 - 108
[28] Similarity Join and Similarity Self-Join Size Estimation in a Streaming Environment
Rafiei, Davood
Deng, Fan
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2020, 32 (04) : 768 - 781
[29] An Empirical Study on Document Similarity Comparison Evaluation Between Machine Learning Techniques and Human Experts
Jang, Won-Jung
TEHNICKI VJESNIK-TECHNICAL GAZETTE, 2024, 31 (05): : 1668 - 1679
[30] Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join
Lu, Jiaheng
Lin, Chunbin
Wang, Jin
Li, Chen
PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM '19), 2019, : 2975 - 2976

← 1 2 3 4 5 →