Dynamic Set Similarity Join: An Update Log Based Approach

被引：3

作者：

Yang, Chengcheng ^{[1
]}

Chen, Lisi ^{[2
]}

Wang, Hao ^{[3
]}

Shang, Shuo ^{[2
]}

Mao, Rui ^{[4
]}

Zhang, Xiangliang ^{[5
]}

机构：

[1] East China Normal Univ, Shanghai Engn Res Ctr Big Data Management, Sch Data Sci & Engn, Shanghai 200241, Peoples R China

[2] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu 611731, Peoples R China

[3] Nanjing Univ Informat Sci & Technol, Sch Comp Sci, Nanjing 210044, Peoples R China

[4] Shenzhen Univ, Coll Comp Sci & Software Engn, Shenzhen 518060, Peoples R China

[5] Univ Notre Dame, Dept Comp Sci & Engn, Notre Dame, IN 46556 USA

来源：

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING | 2023年 / 35卷 / 04期

关键词：

Indexes; Heuristic algorithms; Costs; Computer science; Computational efficiency; Social networking (online); Data mining; Dynamic set similarity join; log filter; adaptive method;

D O I：

10.1109/TKDE.2021.3126631

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The set similarity join finds all pairs of similar sets from two collections of sets. It has many real world applications, such as personalized recommendation and community mining. In this paper, we study the problem of computing the similarity join in a dynamic context, where the sets are updated dynamically. This, however, is inefficient with the state-of-the-art join methods, because they usually assume that data collections are static and have to compute the join result from scratch whenever a set is updated. To address this issue, we propose ALJoin, an adaptive filtering approach that computes the join result incrementally based on the update logs. We first investigate the effect of set updates on the similarity values, and on this basis we propose to build a neighborhood index for each set. The neighborhood index of a specific set consists of any other sets that can be transformed into its similar sets within a threshold number of update operations. ALJoin then uses this index to effectively identify both similar and dissimilar set pairs based on their update logs. To efficiently build the neighborhood index, we devise several filtering techniques and propose a "lazy-forward" method to reduce the computational cost. In addition, to improve the efficiency on varying workloads, we propose an analytical cost model, and design an online algorithm with performance guarantees to dynamically consolidate the update logs and adapt the neighborhood indexes. We evaluated our method using four real-world datasets. Experimental results show that our approach outperforms existing methods by up to 3:7 x .

引用

页码：3727 / 3741

页数：15

共 50 条

[41] String similarity join with different similarity thresholds based on novel indexing techniques
Chuitian Rong
Yasin N. Silva
Chunqing Li
Frontiers of Computer Science, 2017, 11 : 307 - 319
[42] String similarity join with different similarity thresholds based on novel indexing techniques
Rong, Chuitian
Silva, Yasin N.
Li, Chunqing
FRONTIERS OF COMPUTER SCIENCE, 2017, 11 (02) : 307 - 319
[43] A Scalable Similarity Join Algorithm Based on MapReduce and LSH
Sébastien Rivault
Mostafa Bamha
Sébastien Limet
Sophie Robert
International Journal of Parallel Programming, 2022, 50 : 360 - 380
[44] A novel spectral similarity measure approach based on set operations and spectral polygon
Du, PJ
Chen, YH
IGARSS 2005: IEEE International Geoscience and Remote Sensing Symposium, Vols 1-8, Proceedings, 2005, : 4319 - 4322
[45] A novel approach for high-dimensional vector similarity join query
Ma, Youzhong
Jia, Shijie
Zhang, Yongxin
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2017, 29 (05):
[46] Fast-join: An efficient method for fuzzy token matching based string similarity join
Wang, Jiannan
Li, Guoliang
Fe, Jianhua
Proceedings - International Conference on Data Engineering, 2011, : 458 - 469
[47] Fast-Join: An Efficient Method for Fuzzy Token Matching based String Similarity Join
Wang, Jiannan
Li, Guoliang
Fe, Jianhua
IEEE 27TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2011), 2011, : 458 - 469
[48] Similarity computation between fuzzy set and crisp set with similarity measure based on distance
Lee, Sang H.
Park, Hyunjeong
Park, Wook Je
INFORMATION RETRIEVAL TECHNOLOGY, 2008, 4993 : 644 - +
[49] Generalized dynamic attribute reduction based on similarity relation of intuitionistic fuzzy rough set
Zhang Chuanchao
JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2020, 39 (05) : 7107 - 7122
[50] Set similarity modulates object tracking in dynamic environments
Sibel Akyuz
Jaap Munneke
Jennifer E. Corbett
Attention, Perception, & Psychophysics, 2018, 80 : 1744 - 1751

← 1 2 3 4 5 →