Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment

被引:27
|
作者
Fernandez, Raul Castro [1 ]
Min, Jisoo [1 ]
Nava, Demitri [1 ]
Madden, Samuel [1 ]
机构
[1] MIT, CSAIL, Cambridge, MA 02139 USA
关键词
D O I
10.1109/ICDE.2019.00109
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Data analysts often need to find datasets that are similar (i.e., have high overlap) or that are subsets of one another (i.e., one contains the other). Exactly computing such relationships is expensive because it entails an all-pairs comparison between all values in all datasets, an O(n(2)) operation. Fortunately, it is possible to obtain approximate solutions much faster, using locality sensitive hashing (LSH). Unfortunately, LSH does not lend itself naturally to compute containment, and only returns results with a similarity beyond a pre-defined threshold; we want to know the specific similarity and containment score. The main contribution of this paper is LAZO, a method to simultaneously estimate both the similarity and containment of datasets, based on a redefinition of Jaccard similarity which takes into account the cardinality of each set. In addition, we show how to use the method to improve the quality of the original JS and JC estimates. Last, we implement LAZO as a new indexing structure that has these additional properties: i) it returns numerical scores to indicate the degree of similarity and containment between each candidate and the query instead of only returning the candidate set; ii) it permits to query for a specific threshold on-the-fly, as opposed to LSH indexes that need to be configured with a pre-defined threshold a priori; iii) it works in a data-oblivious way, so it can be incrementally maintained. We evaluate LAZO on real-world datasets and show its ability to estimate containment and similarity better and faster than existing methods.
引用
收藏
页码:1190 / 1201
页数:12
相关论文
共 50 条
  • [41] Proposal for a new localisation method using tightly coupled integration based on a precise estimation of trajectory from GPS Doppler
    Kojima, Yoshiko
    Suzuki, Noriyoshi
    Hattori, Yoshikazu
    Teramoto, Eiji
    VEHICLE SYSTEM DYNAMICS, 2012, 50 (06) : 987 - 1000
  • [42] Effective Thermal Property Estimation of Unitary Pebble Beds Based on a CFD-DEM Coupled Method for a Fusion Blanket
    陈磊
    陈有华
    黄凯
    刘松林
    Plasma Science and Technology, 2015, 17 (12) : 1083 - 1087
  • [43] Effective Thermal Property Estimation of Unitary Pebble Beds Based on a CFD-DEM Coupled Method for a Fusion Blanket
    陈磊
    陈有华
    黄凯
    刘松林
    Plasma Science and Technology, 2015, (12) : 1083 - 1087
  • [44] Effective Thermal Property Estimation of Unitary Pebble Beds Based on a CFD-DEM Coupled Method for a Fusion Blanket
    Chen Lei
    Chen Youhua
    Huang Kai
    Liu Songlin
    PLASMA SCIENCE & TECHNOLOGY, 2015, 17 (12) : 1083 - 1087
  • [45] Improving PM2.5 Forecasting and Emission Estimation Based on the Bayesian Optimization Method and the Coupled FLEXPART-WRF Model
    Guo, Lifeng
    Chen, Baozhang
    Zhang, Huifang
    Xu, Guang
    Lu, Lijiang
    Lin, Xiaofeng
    Kong, Yawen
    Wang, Fei
    Li, Yanpeng
    ATMOSPHERE, 2018, 9 (11):
  • [46] Rapid Estimation Method for Miss Distance of Rocket Projectile Based on Coupled Seeker鄄Projectile Model and Adjoint Guidance System
    Cui M.
    Hao H.
    Wang X.
    Meng L.
    Mao Y.
    Wan K.
    Liu Y.
    Zhang Y.
    Binggong Xuebao/Acta Armamentarii, 2022, 43 (10): : 2554 - 2564
  • [47] Efficient metamodel-based importance sampling coupled with single-loop estimation method for parameter global reliability sensitivity analysis
    Yun, Wanying
    Li, Fengyuan
    Chen, Xiangming
    Wang, Zhe
    PROBABILISTIC ENGINEERING MECHANICS, 2024, 76
  • [48] A Cluster Based Algorithm Coupled With Shooting Method for Estimation of Parametric Clusters Yielding Optimal Stable Periodic Solutions in Nonlinear Vibrating Systems
    Premchand, V. P.
    Balaram, Bipin
    Narayanan, M. D.
    Sajith, A. S.
    JOURNAL OF COMPUTATIONAL AND NONLINEAR DYNAMICS, 2022, 17 (09):
  • [49] Spoofing profile estimation-based GNSS spoofing identification method for tightly coupled MEMS INS/GNSS integrated navigation system
    Yimin, Wei
    Hong, Li
    Mingquan, Lu
    IET RADAR SONAR AND NAVIGATION, 2020, 14 (02): : 216 - 225
  • [50] Error estimation and cross-coupled control based on a novel tool pose representation method of a five-axis hybrid machine tool
    Wang, Liping
    Kong, Xiangyu
    Yu, Guang
    Li, Weitao
    Li, Mengyu
    Jiang, Anbang
    INTERNATIONAL JOURNAL OF MACHINE TOOLS & MANUFACTURE, 2022, 182