Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment

被引:27
|
作者
Fernandez, Raul Castro [1 ]
Min, Jisoo [1 ]
Nava, Demitri [1 ]
Madden, Samuel [1 ]
机构
[1] MIT, CSAIL, Cambridge, MA 02139 USA
关键词
D O I
10.1109/ICDE.2019.00109
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Data analysts often need to find datasets that are similar (i.e., have high overlap) or that are subsets of one another (i.e., one contains the other). Exactly computing such relationships is expensive because it entails an all-pairs comparison between all values in all datasets, an O(n(2)) operation. Fortunately, it is possible to obtain approximate solutions much faster, using locality sensitive hashing (LSH). Unfortunately, LSH does not lend itself naturally to compute containment, and only returns results with a similarity beyond a pre-defined threshold; we want to know the specific similarity and containment score. The main contribution of this paper is LAZO, a method to simultaneously estimate both the similarity and containment of datasets, based on a redefinition of Jaccard similarity which takes into account the cardinality of each set. In addition, we show how to use the method to improve the quality of the original JS and JC estimates. Last, we implement LAZO as a new indexing structure that has these additional properties: i) it returns numerical scores to indicate the degree of similarity and containment between each candidate and the query instead of only returning the candidate set; ii) it permits to query for a specific threshold on-the-fly, as opposed to LSH indexes that need to be configured with a pre-defined threshold a priori; iii) it works in a data-oblivious way, so it can be incrementally maintained. We evaluate LAZO on real-world datasets and show its ability to estimate containment and similarity better and faster than existing methods.
引用
收藏
页码:1190 / 1201
页数:12
相关论文
共 50 条
  • [31] Ensemble-Based Parameter Estimation in a Coupled GCM Using the Adaptive Spatial Average Method
    Liu, Y.
    Liu, Z.
    Zhang, S.
    Rong, X.
    Jacob, R.
    Wu, S.
    Lu, F.
    JOURNAL OF CLIMATE, 2014, 27 (11) : 4002 - 4014
  • [32] A fault early warning method for auxiliary equipment based on multivariate state estimation technique and sliding window similarity
    Zhang, Wei
    Liu, Jizhen
    Gao, Mingming
    Pan, Chenyang
    Huusom, Jakob K.
    COMPUTERS IN INDUSTRY, 2019, 107 : 67 - 80
  • [33] Annual runoff prediction using a nearest-neighbour method based on cosine angle distance for similarity estimation
    Qin, Guanghua
    Li, Hongxia
    Wang, Xin
    He, Qingyan
    Li, Shenqi
    REMOTE SENSING AND GIS FOR HYDROLOGY AND WATER RESOURCES, 2015, 368 : 204 - 208
  • [34] Multiple-attribute decision-making method using similarity measures of single-valued neutrosophic hesitant fuzzy sets based on least common multiple cardinality
    Ye, Jun
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2018, 34 (06) : 4203 - 4211
  • [35] A Fault Data Based Method for Zero-Sequence Impedance Estimation of Mutually Coupled Transmission Lines
    Gashteroodkhani, Oveis Asgari
    Majidi, Mehrdad
    Etezadi-Amoli, Mehdi
    IEEE TRANSACTIONS ON POWER DELIVERY, 2021, 36 (05) : 2768 - 2776
  • [36] A novel method for reliability estimation of optoelectronic coupled devices based on low-frequency noise measurements
    Chen, Xiaojuan
    Wang, Wenting
    Li, Nan
    Tang, Longyong
    Journal of Information and Computational Science, 2014, 11 (10): : 3363 - 3371
  • [37] Development and application of a coupled-process parameter inversion model based on the maximum likelihood estimation method
    Mayer, AS
    Huang, CL
    ADVANCES IN WATER RESOURCES, 1999, 22 (08) : 841 - 853
  • [38] An evaporation estimation method based on the coupled 2-D turbulent heat and vapor transport equations
    Szilagyi, Jozsef
    Jozsa, Janos
    JOURNAL OF GEOPHYSICAL RESEARCH-ATMOSPHERES, 2009, 114 : D06101
  • [39] A SEMG-Force Estimation Framework Based on a Fast Orthogonal Search Method Coupled with Factorization Algorithms
    Chen, Xiang
    Yuan, Yuan
    Cao, Shuai
    Zhang, Xu
    Chen, Xun
    SENSORS, 2018, 18 (07)
  • [40] A multi-indicator modeling method for similarity-based residual useful life estimation with two selection processes
    Gu M.
    Chen Y.
    International Journal of System Assurance Engineering and Management, 2018, 9 (5) : 987 - 998