Dynamic Set Similarity Join: An Update Log Based Approach

被引:3
|
作者
Yang, Chengcheng [1 ]
Chen, Lisi [2 ]
Wang, Hao [3 ]
Shang, Shuo [2 ]
Mao, Rui [4 ]
Zhang, Xiangliang [5 ]
机构
[1] East China Normal Univ, Shanghai Engn Res Ctr Big Data Management, Sch Data Sci & Engn, Shanghai 200241, Peoples R China
[2] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu 611731, Peoples R China
[3] Nanjing Univ Informat Sci & Technol, Sch Comp Sci, Nanjing 210044, Peoples R China
[4] Shenzhen Univ, Coll Comp Sci & Software Engn, Shenzhen 518060, Peoples R China
[5] Univ Notre Dame, Dept Comp Sci & Engn, Notre Dame, IN 46556 USA
关键词
Indexes; Heuristic algorithms; Costs; Computer science; Computational efficiency; Social networking (online); Data mining; Dynamic set similarity join; log filter; adaptive method;
D O I
10.1109/TKDE.2021.3126631
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The set similarity join finds all pairs of similar sets from two collections of sets. It has many real world applications, such as personalized recommendation and community mining. In this paper, we study the problem of computing the similarity join in a dynamic context, where the sets are updated dynamically. This, however, is inefficient with the state-of-the-art join methods, because they usually assume that data collections are static and have to compute the join result from scratch whenever a set is updated. To address this issue, we propose ALJoin, an adaptive filtering approach that computes the join result incrementally based on the update logs. We first investigate the effect of set updates on the similarity values, and on this basis we propose to build a neighborhood index for each set. The neighborhood index of a specific set consists of any other sets that can be transformed into its similar sets within a threshold number of update operations. ALJoin then uses this index to effectively identify both similar and dissimilar set pairs based on their update logs. To efficiently build the neighborhood index, we devise several filtering techniques and propose a "lazy-forward" method to reduce the computational cost. In addition, to improve the efficiency on varying workloads, we propose an analytical cost model, and design an online algorithm with performance guarantees to dynamically consolidate the update logs and adapt the neighborhood indexes. We evaluated our method using four real-world datasets. Experimental results show that our approach outperforms existing methods by up to 3:7 x .
引用
收藏
页码:3727 / 3741
页数:15
相关论文
共 50 条
  • [41] String similarity join with different similarity thresholds based on novel indexing techniques
    Chuitian Rong
    Yasin N. Silva
    Chunqing Li
    Frontiers of Computer Science, 2017, 11 : 307 - 319
  • [42] String similarity join with different similarity thresholds based on novel indexing techniques
    Rong, Chuitian
    Silva, Yasin N.
    Li, Chunqing
    FRONTIERS OF COMPUTER SCIENCE, 2017, 11 (02) : 307 - 319
  • [43] A Scalable Similarity Join Algorithm Based on MapReduce and LSH
    Sébastien Rivault
    Mostafa Bamha
    Sébastien Limet
    Sophie Robert
    International Journal of Parallel Programming, 2022, 50 : 360 - 380
  • [44] A novel spectral similarity measure approach based on set operations and spectral polygon
    Du, PJ
    Chen, YH
    IGARSS 2005: IEEE International Geoscience and Remote Sensing Symposium, Vols 1-8, Proceedings, 2005, : 4319 - 4322
  • [45] A novel approach for high-dimensional vector similarity join query
    Ma, Youzhong
    Jia, Shijie
    Zhang, Yongxin
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2017, 29 (05):
  • [46] Fast-join: An efficient method for fuzzy token matching based string similarity join
    Wang, Jiannan
    Li, Guoliang
    Fe, Jianhua
    Proceedings - International Conference on Data Engineering, 2011, : 458 - 469
  • [47] Fast-Join: An Efficient Method for Fuzzy Token Matching based String Similarity Join
    Wang, Jiannan
    Li, Guoliang
    Fe, Jianhua
    IEEE 27TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2011), 2011, : 458 - 469
  • [48] Similarity computation between fuzzy set and crisp set with similarity measure based on distance
    Lee, Sang H.
    Park, Hyunjeong
    Park, Wook Je
    INFORMATION RETRIEVAL TECHNOLOGY, 2008, 4993 : 644 - +
  • [49] Generalized dynamic attribute reduction based on similarity relation of intuitionistic fuzzy rough set
    Zhang Chuanchao
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2020, 39 (05) : 7107 - 7122
  • [50] Set similarity modulates object tracking in dynamic environments
    Sibel Akyuz
    Jaap Munneke
    Jennifer E. Corbett
    Attention, Perception, & Psychophysics, 2018, 80 : 1744 - 1751