LES3: Learning-based Exact Set Similarity Search

被引:3
|
作者
Li, Yifan [1 ]
Yu, Xiaohui [1 ]
Koudas, Nick [2 ]
机构
[1] York Univ, Toronto, ON, Canada
[2] Univ Toronto, Toronto, ON, Canada
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2021年 / 14卷 / 11期
关键词
FRAMEWORK; JOINS;
D O I
10.14778/3476249.3476263
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Set similarity search is a problem of central interest to a wide variety of applications such as data cleaning and web search. Past approaches on set similarity search utilize either heavy indexing structures, incurring large search costs or indexes that produce large candidate sets. In this paper, we design a learning-based exact set similarity search approach, LES3. Our approach first partitions sets into groups, and then utilizes a light-weight bitmap-like indexing structure, called token-group matrix (TGM), to organize groups and prune out candidates given a query set. In order to optimize pruning using the TGM, we analytically investigate the optimal partitioning strategy under certain distributional assumptions. Using these results, we then design a learning-based partitioning approach called L2P and an associated data representation encoding, PTR, to identify the partitions. We conduct extensive experiments on real and synthetic datasets to fully study LES3, establishing the effectiveness and superiority over other applicable approaches.
引用
收藏
页码:2073 / 2086
页数:14
相关论文
共 50 条
  • [1] A Learning-Based Approach for Multi-scenario Trajectory Similarity Search
    Feng, Chunhui
    Pan, Zhicheng
    Fang, Junhua
    Chao, Pingfu
    Liu, An
    Zhao, Lei
    WEB INFORMATION SYSTEMS ENGINEERING - WISE 2022, 2022, 13724 : 478 - 492
  • [2] Iterative machine learning-based chemical similarity search to identify novel chemical inhibitors
    Durai, Prasannavenkatesh
    Lee, Sue Jung
    Lee, Jae Wook
    Pan, Cheol-Ho
    Park, Keunwan
    JOURNAL OF CHEMINFORMATICS, 2023, 15 (01)
  • [3] Iterative machine learning-based chemical similarity search to identify novel chemical inhibitors
    Prasannavenkatesh Durai
    Sue Jung Lee
    Jae Wook Lee
    Cheol-Ho Pan
    Keunwan Park
    Journal of Cheminformatics, 15
  • [4] An Efficient Framework for Exact Set Similarity Search using Tree Structure Indexes
    Zhang, Yong
    Li, Xiuxing
    Wang, Jin
    Zhang, Ying
    Xing, Chunxiao
    Yuan, Xiaojie
    2017 IEEE 33RD INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2017), 2017, : 759 - 770
  • [5] An Efficient Partition Based Method for Exact Set Similarity Joins
    Deng, Dong
    Li, Guoliang
    Wen, He
    Feng, Jianhua
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2015, 9 (04): : 360 - 371
  • [6] Set-based Similarity Search for Time Series
    Peng, Jinglin
    Wang, Hongzhi
    Li, Jianzhong
    Gao, Hong
    SIGMOD'16: PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2016, : 2039 - 2052
  • [7] Learning-based similarity measurement for fuzzy sets
    Tocatlidou, A
    INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 1998, 13 (2-3) : 193 - 220
  • [8] On Projection Based Operators in lp Space for Exact Similarity Search
    Wichert, Andreas
    Moreira, Catarina
    FUNDAMENTA INFORMATICAE, 2015, 136 (04) : 461 - 474
  • [9] Learning-Based Efficient Graph Similarity Computation via Multi-Scale Convolutional Set Matching
    Bai, Yunsheng
    Ding, Hao
    Gu, Ken
    Sun, Yizhou
    Wang, Wei
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 3219 - 3226
  • [10] A Transformation-Based Framework for KNN Set Similarity Search
    Zhang, Yong
    Wu, Jiacheng
    Wang, Jin
    Xing, Chunxiao
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2020, 32 (03) : 409 - 423