Dynamic Set Similarity Join: An Update Log Based Approach

被引:3
|
作者
Yang, Chengcheng [1 ]
Chen, Lisi [2 ]
Wang, Hao [3 ]
Shang, Shuo [2 ]
Mao, Rui [4 ]
Zhang, Xiangliang [5 ]
机构
[1] East China Normal Univ, Shanghai Engn Res Ctr Big Data Management, Sch Data Sci & Engn, Shanghai 200241, Peoples R China
[2] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu 611731, Peoples R China
[3] Nanjing Univ Informat Sci & Technol, Sch Comp Sci, Nanjing 210044, Peoples R China
[4] Shenzhen Univ, Coll Comp Sci & Software Engn, Shenzhen 518060, Peoples R China
[5] Univ Notre Dame, Dept Comp Sci & Engn, Notre Dame, IN 46556 USA
关键词
Indexes; Heuristic algorithms; Costs; Computer science; Computational efficiency; Social networking (online); Data mining; Dynamic set similarity join; log filter; adaptive method;
D O I
10.1109/TKDE.2021.3126631
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The set similarity join finds all pairs of similar sets from two collections of sets. It has many real world applications, such as personalized recommendation and community mining. In this paper, we study the problem of computing the similarity join in a dynamic context, where the sets are updated dynamically. This, however, is inefficient with the state-of-the-art join methods, because they usually assume that data collections are static and have to compute the join result from scratch whenever a set is updated. To address this issue, we propose ALJoin, an adaptive filtering approach that computes the join result incrementally based on the update logs. We first investigate the effect of set updates on the similarity values, and on this basis we propose to build a neighborhood index for each set. The neighborhood index of a specific set consists of any other sets that can be transformed into its similar sets within a threshold number of update operations. ALJoin then uses this index to effectively identify both similar and dissimilar set pairs based on their update logs. To efficiently build the neighborhood index, we devise several filtering techniques and propose a "lazy-forward" method to reduce the computational cost. In addition, to improve the efficiency on varying workloads, we propose an analytical cost model, and design an online algorithm with performance guarantees to dynamically consolidate the update logs and adapt the neighborhood indexes. We evaluated our method using four real-world datasets. Experimental results show that our approach outperforms existing methods by up to 3:7 x .
引用
收藏
页码:3727 / 3741
页数:15
相关论文
共 50 条
  • [1] Leveraging set relations in exact and dynamic set similarity join
    Xubo Wang
    Lu Qin
    Xuemin Lin
    Ying Zhang
    Lijun Chang
    The VLDB Journal, 2019, 28 : 267 - 292
  • [2] Leveraging set relations in exact and dynamic set similarity join
    Wang, Xubo
    Qin, Lu
    Lin, Xuemin
    Zhang, Ying
    Chang, Lijun
    VLDB JOURNAL, 2019, 28 (02): : 267 - 292
  • [3] How improve Set Similarity Join based on prefix approach in distributed environment
    Zhu, Song
    Gagliardelli, Luca
    Simonini, Giovanni
    Beneventano, Domenico
    PROCEEDINGS 2018 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING & SIMULATION (HPCS), 2018, : 844 - 851
  • [4] Distributed Streaming Set Similarity Join
    Yang, Jianye
    Zhang, Wenjie
    Wang, Xiang
    Zhang, Ying
    Lin, Xuemin
    2020 IEEE 36TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2020), 2020, : 565 - 576
  • [5] Set Similarity Join on Probabilistic Data
    Lian, Xiang
    Chen, Lei
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2010, 3 (01): : 650 - 659
  • [6] Scalable and Robust Set Similarity Join
    Christiani, Tobias
    Pagh, Rasmus
    Sivertsen, Johan
    2018 IEEE 34TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2018, : 1240 - 1243
  • [7] Leveraging Set Relations in Exact Set Similarity Join
    Wang, Xubo
    Qin, Lu
    Lin, Xuemin
    Zhang, Ying
    Chang, Lijun
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2017, 10 (09): : 925 - 936
  • [8] Power-Law Based Estimation of Set Similarity Join Size
    Lee, Hongrae
    Ng, Raymond T.
    Shim, Kyuseok
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2009, 2 (01): : 658 - 669
  • [9] An Empirical Evaluation of Set Similarity Join Techniques
    Mann, Willi
    Augsten, Nikolaus
    Bouros, Panagiotis
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2016, 9 (09): : 636 - 647
  • [10] Fuzzy Similarity Join Algorithm Based on Dynamic Double Prefixes
    Yu C.-Y.
    Wang W.-H.
    Wen X.-J.
    Zhao Y.-H.
    Dongbei Daxue Xuebao/Journal of Northeastern University, 2022, 43 (03): : 321 - 327