SETJoin: a novel top-k similarity join algorithm

被引:0
|
作者
Hongya Wang
Lihong Yang
Yingyuan Xiao
机构
[1] Donghua University,School of Computer Science and Technology
[2] Tianjin University of Technology,School of Computer Science and Technology
来源
Soft Computing | 2020年 / 24卷
关键词
Set similarity join; Query processing; Candidate filtering;
D O I
暂无
中图分类号
学科分类号
摘要
As an important operation in data cleaning, near duplicate Web pages detection and data mining, similarity joins have received much attention recently. Existing similarity joins fall into two broad categories—the similarity-threshold-based similarity join and top-ksimilarity join (TopkJoin). Compared with the traditional one, TopkJoin is more suitable for cases where the similarity threshold is unknown before hand. In this paper, we focus on the performance optimization problem of TopkJoin. Particularly, we observed that the state-of-the-art TopkJoin algorithm has three serious performance issues, i.e., the inappropriate application of hash table, inefficient use of suffix filtering and unnecessary evaluation of excessive unqualified candidates. To resolve these problems, we proposed a novel algorithm, SETJoin, by combining the existing event-driven framework with three simple yet efficient optimization techniques, viz., (1) reducing the cost in hashing by rearranging the orders of the candidate filtering and hash table lookup operations; (2) maximizing the pruning capability of suffix filtering by judiciously choosing the (near) optimal recursion depth; and (3) terminating join operations earlier by setting a much tighter stop condition for iteration. The experimental results show that SETJoin achieves up to 1.26x–3.49x speedup over the state-of-the-art algorithm on several real datasets.
引用
收藏
页码:14577 / 14592
页数:15
相关论文
共 50 条
  • [1] Top-k Tree Similarity Join
    Wang, Jianhua
    Yang, Jianye
    Zhang, Wenjie
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 1939 - 1948
  • [2] SETJoin: a novel top-ksimilarity join algorithm
    Wang, Hongya
    Yang, Lihong
    Xiao, Yingyuan
    [J]. SOFT COMPUTING, 2020, 24 (19) : 14577 - 14592
  • [3] Fast top-k similarity join for SimRank
    Li, Ruiqi
    Zhao, Xiang
    Shang, Haichuan
    Chen, Yifan
    Xiao, Weidong
    [J]. INFORMATION SCIENCES, 2017, 381 : 1 - 19
  • [4] Top-k Similarity Join in Heterogeneous Information Networks
    Xiong, Yun
    Zhu, Yangyong
    Yu, Philip S.
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2015, 27 (06) : 1710 - 1723
  • [5] Top-k Spatio-Textual Similarity Join
    Hu, Huiqi
    Li, Guoliang
    Bao, Zhifeng
    Feng, Jianhua
    Wu, Yongwei
    Gong, Zhiguo
    Xu, Yaoqiang
    [J]. 2016 32ND IEEE INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2016, : 1576 - 1577
  • [6] Top-k Spatio-Textual Similarity Join
    Hu, Huiqi
    Li, Guoliang
    Bao, Zhifeng
    Feng, Jianhua
    Wu, Yongwei
    Gong, Zhiguo
    Xu, Yaoqiang
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2016, 28 (02) : 551 - 565
  • [7] Parallel Top-K Similarity Join Algorithms Using MapReduce
    Kim, Younghoon
    Shim, Kyuseok
    [J]. 2012 IEEE 28TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2012, : 510 - 521
  • [8] Efficient Top-K SimRank-based Similarity Join
    Tao, Wenbo
    [J]. SIGMOD'14: PROCEEDINGS OF THE 2014 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2014, : 1603 - 1604
  • [9] Efficient Top-K SimRank-based Similarity Join
    Tao, Wenbo
    Yu, Minghe
    Li, Guoliang
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2014, 8 (03): : 317 - 328
  • [10] Top-k Pipe Join
    Martinenghi, Davide
    Tagliasacchi, Marco
    [J]. 2010 IEEE 26TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING WORKSHOPS (ICDE 2010), 2010, : 16 - 19