Dynamic Set Similarity Join: An Update Log Based Approach

被引：3

作者：

Yang, Chengcheng ^{[1
]}

Chen, Lisi ^{[2
]}

Wang, Hao ^{[3
]}

Shang, Shuo ^{[2
]}

Mao, Rui ^{[4
]}

Zhang, Xiangliang ^{[5
]}

机构：

[1] East China Normal Univ, Shanghai Engn Res Ctr Big Data Management, Sch Data Sci & Engn, Shanghai 200241, Peoples R China

[2] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu 611731, Peoples R China

[3] Nanjing Univ Informat Sci & Technol, Sch Comp Sci, Nanjing 210044, Peoples R China

[4] Shenzhen Univ, Coll Comp Sci & Software Engn, Shenzhen 518060, Peoples R China

[5] Univ Notre Dame, Dept Comp Sci & Engn, Notre Dame, IN 46556 USA

来源：

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING | 2023年 / 35卷 / 04期

关键词：

Indexes; Heuristic algorithms; Costs; Computer science; Computational efficiency; Social networking (online); Data mining; Dynamic set similarity join; log filter; adaptive method;

D O I：

10.1109/TKDE.2021.3126631

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The set similarity join finds all pairs of similar sets from two collections of sets. It has many real world applications, such as personalized recommendation and community mining. In this paper, we study the problem of computing the similarity join in a dynamic context, where the sets are updated dynamically. This, however, is inefficient with the state-of-the-art join methods, because they usually assume that data collections are static and have to compute the join result from scratch whenever a set is updated. To address this issue, we propose ALJoin, an adaptive filtering approach that computes the join result incrementally based on the update logs. We first investigate the effect of set updates on the similarity values, and on this basis we propose to build a neighborhood index for each set. The neighborhood index of a specific set consists of any other sets that can be transformed into its similar sets within a threshold number of update operations. ALJoin then uses this index to effectively identify both similar and dissimilar set pairs based on their update logs. To efficiently build the neighborhood index, we devise several filtering techniques and propose a "lazy-forward" method to reduce the computational cost. In addition, to improve the efficiency on varying workloads, we propose an analytical cost model, and design an online algorithm with performance guarantees to dynamically consolidate the update logs and adapt the neighborhood indexes. We evaluated our method using four real-world datasets. Experimental results show that our approach outperforms existing methods by up to 3:7 x .

引用

页码：3727 / 3741

页数：15

共 50 条

[21] Approximate Set Similarity Join Using Many-Core Processors
Sugano, Kenta
Amagasa, Toshiyuki
Kitagawa, Hiroyuki
DATABASE AND EXPERT SYSTEMS APPLICATIONS (DEXA 2018), PT II, 2018, 11030 : 214 - 222
[22] Accelerating Progressive Set Similarity Join with the CPU-GPU Architecture
Yu, Lining
Nie, Tiezheng
Shen, Derong
Kou, Yue
BIG DATA RESEARCH, 2021, 26
[23] An empirical evaluation of exact set similarity join techniques using GPUs
Bellas, Christos
Gounaris, Anastasios
INFORMATION SYSTEMS, 2020, 89
[24] HySet: A hybrid framework for exact set similarity join using a GPU
Bellas, Christos
Gounaris, Anastasios
PARALLEL COMPUTING, 2021, 104
[25] An approach for XML similarity join using tree serialization
Wen, Lianzi
Amagasa, Toshiyuki
Kitagawa, Hiroyuki
DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, 2008, 4947 : 562 - 570
[26] Trajectory Set Similarity Measure: An EMD-Based Approach
He, Dan
Ruan, Boyu
Zheng, Bolong
Zhou, Xiaofang
DATABASES THEORY AND APPLICATIONS, ADC 2018, 2018, 10837 : 28 - 40
[27] Similarity join on time series by utilizing a dynamic segmentation index
Jinhua Wang
Qiuhong Li
Zhongsheng Li
Peng Wang
Yang Wang
Wei Wang
Ningting Pan
Mingmin Chi
Knowledge and Information Systems, 2019, 61 : 1517 - 1546
[28] Efficient subgraph join based on connectivity similarity
Wang, Yue
Wang, Hongzhi
Li, Jianzhong
Gao, Hong
WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2015, 18 (04): : 871 - 887
[29] Parallel String Similarity Join Approach Based on CPU-GPU Heterogeneous Architecture
Xu K.
Nie T.
Shen D.
Kou Y.
Yu G.
Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2021, 58 (03): : 598 - 608
[30] Efficient SimRank-Based Similarity Join
Zheng, Weiguo
Zou, Lei
Chen, Lei
Zhao, Dongyan
ACM TRANSACTIONS ON DATABASE SYSTEMS, 2017, 42 (03):

← 1 2 3 4 5 →