JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes

被引:79
|
作者
Zhu, Erkang [1 ]
Deng, Dong [2 ,3 ]
Nargesian, Fatemeh [1 ]
Miller, Renee J. [4 ]
机构
[1] Univ Toronto, Toronto, ON, Canada
[2] Rutgers State Univ, Piscataway, NJ USA
[3] Incept Inst Artificial Intelligence, Abu Dhabi, U Arab Emirates
[4] Northeastern Univ, Boston, MA 02115 USA
基金
加拿大自然科学与工程研究理事会;
关键词
MAPREDUCE; JOINS; FRAMEWORK;
D O I
10.1145/3299869.3300065
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We present a new solution for finding joinable tables in massive data lakes: given a table and one join column, find tables that can be joined with the given table on the largest number of distinct values. The problem can be formulated as an overlap set similarity search problem by considering columns as sets and matching values as intersection between sets. Although set similarity search is well-studied in the field of approximate string search (e.g., fuzzy keyword search), the solutions are designed for and evaluated over sets of relatively small size (average set size rarely much over 100 and maximum set size in the low thousands) with modest dictionary sizes (the total number of distinct values in all sets is only a few million). We observe that modern data lakes typically have massive set sizes (with maximum set sizes that may be tens of millions) and dictionaries that include hundreds of millions of distinct values. Our new algorithm, JOSIE (JOining Search using Intersection Estimation) minimizes the cost of set reads and inverted index probes used in finding the top-k sets. We show that JOSIE completely out performs the state-of-the-art overlap set similarity search techniques on data lakes. More surprising, we also consider state-of-the-art approximate algorithm and show that our new exact search algorithm performs almost as well, and even in some cases better, on real data lakes.
引用
收藏
页码:847 / 864
页数:18
相关论文
共 7 条
  • [1] LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes
    Deng, Yuhao
    Chai, Chengliang
    Cao, Lei
    Yuan, Qin
    Chen, Siyuan
    Yu, Yanrui
    Sun, Zhaoze
    Wang, Junyi
    Li, Jiajun
    Cao, Ziqi
    Jin, Kaisen
    Zhang, Chi
    Jiang, Yuqing
    Zhang, Yuanfang
    Wang, Yuping
    Yuan, Ye
    Wang, Guoren
    Tang, Nan
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2024, 17 (08): : 1925 - 1938
  • [2] Finding Related Tables in Data Lakes for Interactive Data Science
    Zhang, Yi
    Ives, Zachary G.
    SIGMOD'20: PROCEEDINGS OF THE 2020 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2020, : 1951 - 1966
  • [3] Set Similarity Search for Skewed Data
    McCauley, Samuel
    Mikkelsen, Jesper W.
    Pagh, Rasmus
    PODS'18: PROCEEDINGS OF THE 37TH ACM SIGMOD-SIGACT-SIGAI SYMPOSIUM ON PRINCIPLES OF DATABASE SYSTEMS, 2018, : 63 - 74
  • [4] Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach
    Dong, Yuyang
    Takeoka, Kunihiro
    Xiao, Chuan
    Oyamada, Masafumi
    2021 IEEE 37TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2021), 2021, : 456 - 467
  • [5] Efficient processing of similarity search on uncertain set-valued data
    Chen, Ke
    Hong, Yin-Jie
    Chen, Gang
    Ruan Jian Xue Bao/Journal of Software, 2012, 23 (06): : 1588 - 1601
  • [6] FAST SIMILARITY SEARCH ON A LARGE SPEECH DATA SET WITH NEIGHBORHOOD GRAPH INDEXING
    Aoyama, Kazuo
    Watanabe, Shinji
    Sawada, Hiroshi
    Minami, Yasuhiro
    Ueda, Naonori
    Saito, Kazumi
    2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 5358 - 5361
  • [7] Achieving Efficient and Privacy-Preserving Exact Set Similarity Search over Encrypted Data
    Zheng, Yandong
    Lu, Rongxing
    Guan, Yunguo
    Shao, Jun
    Zhu, Hui
    IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2022, 19 (02) : 1090 - 1103