Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment

被引:27
|
作者
Fernandez, Raul Castro [1 ]
Min, Jisoo [1 ]
Nava, Demitri [1 ]
Madden, Samuel [1 ]
机构
[1] MIT, CSAIL, Cambridge, MA 02139 USA
关键词
D O I
10.1109/ICDE.2019.00109
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Data analysts often need to find datasets that are similar (i.e., have high overlap) or that are subsets of one another (i.e., one contains the other). Exactly computing such relationships is expensive because it entails an all-pairs comparison between all values in all datasets, an O(n(2)) operation. Fortunately, it is possible to obtain approximate solutions much faster, using locality sensitive hashing (LSH). Unfortunately, LSH does not lend itself naturally to compute containment, and only returns results with a similarity beyond a pre-defined threshold; we want to know the specific similarity and containment score. The main contribution of this paper is LAZO, a method to simultaneously estimate both the similarity and containment of datasets, based on a redefinition of Jaccard similarity which takes into account the cardinality of each set. In addition, we show how to use the method to improve the quality of the original JS and JC estimates. Last, we implement LAZO as a new indexing structure that has these additional properties: i) it returns numerical scores to indicate the degree of similarity and containment between each candidate and the query instead of only returning the candidate set; ii) it permits to query for a specific threshold on-the-fly, as opposed to LSH indexes that need to be configured with a pre-defined threshold a priori; iii) it works in a data-oblivious way, so it can be incrementally maintained. We evaluate LAZO on real-world datasets and show its ability to estimate containment and similarity better and faster than existing methods.
引用
收藏
页码:1190 / 1201
页数:12
相关论文
共 50 条
  • [1] A class of rational cardinality-based similarity measures
    De Baets, B
    De Meyer, H
    Naessens, H
    JOURNAL OF COMPUTATIONAL AND APPLIED MATHEMATICS, 2001, 132 (01) : 51 - 69
  • [3] Fuzzy Transitivity and Monotonicity of Cardinality-based Similarity Measures
    Ashraf, S.
    Husnine, S. M.
    Rashid, T.
    FUZZY INFORMATION AND ENGINEERING, 2012, 4 (02) : 145 - 153
  • [4] On the transitivity of a parametric family of cardinality-based similarity measures
    De Baets, B.
    Janssens, S.
    De Meyer, H.
    INTERNATIONAL JOURNAL OF APPROXIMATE REASONING, 2009, 50 (01) : 104 - 116
  • [5] A BIPARAMETRIC FAMILY OF CARDINALITY-BASED FUZZY SIMILARITY MEASURES
    Bosteels, Klaas
    Kerre, Etienne E.
    NEW MATHEMATICS AND NATURAL COMPUTATION, 2007, 3 (03) : 307 - 319
  • [6] A triparametric family of cardinality-based fuzzy similarity measures
    Bosteels, Klaas
    Kerre, Etienne E.
    FUZZY SETS AND SYSTEMS, 2007, 158 (22) : 2466 - 2479
  • [7] On transitivity of parametric family of cardinality-based fuzzy similarity measures
    Javed, Muhammad Aslam
    Husnine, Syed Muhammad
    Ashraf, Samina
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2018, 34 (04) : 2689 - 2706
  • [8] Transitivity-preserving fuzzification schemes for cardinality-based similarity measures
    De Baets, B
    De Meyer, H
    EUROPEAN JOURNAL OF OPERATIONAL RESEARCH, 2005, 160 (03) : 726 - 740
  • [9] Transitivity of parametric family of cardinality-based fuzzy similarity measures using Lukasiewicz t-norm
    Javed, Muhammad Aslam
    Ashraf, Samina
    Husnine, Syed Muhammad
    BULLETIN OF COMPUTATIONAL APPLIED MATHEMATICS, 2018, 6 (01): : 9 - 40
  • [10] CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices
    Liu, Shaopeng
    Koslicki, David
    BIOINFORMATICS, 2022, 38 (SUPPL 1) : 28 - 35