Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment

被引:27
|
作者
Fernandez, Raul Castro [1 ]
Min, Jisoo [1 ]
Nava, Demitri [1 ]
Madden, Samuel [1 ]
机构
[1] MIT, CSAIL, Cambridge, MA 02139 USA
关键词
D O I
10.1109/ICDE.2019.00109
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Data analysts often need to find datasets that are similar (i.e., have high overlap) or that are subsets of one another (i.e., one contains the other). Exactly computing such relationships is expensive because it entails an all-pairs comparison between all values in all datasets, an O(n(2)) operation. Fortunately, it is possible to obtain approximate solutions much faster, using locality sensitive hashing (LSH). Unfortunately, LSH does not lend itself naturally to compute containment, and only returns results with a similarity beyond a pre-defined threshold; we want to know the specific similarity and containment score. The main contribution of this paper is LAZO, a method to simultaneously estimate both the similarity and containment of datasets, based on a redefinition of Jaccard similarity which takes into account the cardinality of each set. In addition, we show how to use the method to improve the quality of the original JS and JC estimates. Last, we implement LAZO as a new indexing structure that has these additional properties: i) it returns numerical scores to indicate the degree of similarity and containment between each candidate and the query instead of only returning the candidate set; ii) it permits to query for a specific threshold on-the-fly, as opposed to LSH indexes that need to be configured with a pre-defined threshold a priori; iii) it works in a data-oblivious way, so it can be incrementally maintained. We evaluate LAZO on real-world datasets and show its ability to estimate containment and similarity better and faster than existing methods.
引用
收藏
页码:1190 / 1201
页数:12
相关论文
共 50 条
  • [21] Nearest neighbor method based on a coupled similarity indicator and its application in annual runoff prediction
    Li, Hongxia
    He, Qingyan
    Peng, Hui
    Qin, Guanghua
    Ding, Jing
    Shuikexue Jinzhan/Advances in Water Science, 2015, 26 (02): : 161 - 168
  • [22] Multi-Source Interval-Typed Sensor Information Fusion Based on a New Belief Structure Generating Method Using ILWD and Jaccard Similarity Coefficient
    Automotive Data of China Company Ltd., Tianjin
    300300, China
    IEEE Access, (125668-125680):
  • [23] Multi-Source Interval-Typed Sensor Information Fusion Based on a New Belief Structure Generating Method Using ILWD and Jaccard Similarity Coefficient
    Lin, Jinzhou
    Liu, Lin
    Wang, Juncheng
    IEEE ACCESS, 2024, 12 : 125668 - 125680
  • [24] A coupled subsample displacement estimation method for ultrasound-based strain elastography
    Jiang, Jingfeng
    Hall, Timothy J.
    PHYSICS IN MEDICINE AND BIOLOGY, 2015, 60 (21): : 8347 - 8364
  • [25] Power estimation method of low-voltage distributed photovoltaic generation based on similarity aggregation
    Chen, Xinhe
    Li, Shufeng
    Wang, Fangsheng
    Li, Jiping
    Tang, Chenghong
    ENERGY REPORTS, 2021, 7 : 1344 - 1351
  • [26] A novel motion estimation method based on structural similarity for H.264 inter prediction
    Mai, Zhi-Yi
    Yang, Chun-Ling
    Kuang, Kai-Zhi
    Po, Lai-Man
    2006 IEEE International Conference on Acoustics, Speech and Signal Processing, Vols 1-13, 2006, : 2161 - 2164
  • [27] Sylvester Matrix-Based Similarity Estimation Method for Automation of Defect Detection in Textile Fabrics
    Kumari, R. M. L. N.
    Bandara, G. A. C. T.
    Dissanayake, Maheshi B.
    JOURNAL OF SENSORS, 2021, 2021
  • [28] A Coupled State Estimation Method of Lithium Batteries Based on Partial Charging Voltage Segment
    Wang P.
    Zhang J.
    Cheng Z.
    Yu Y.
    Hunan Daxue Xuebao/Journal of Hunan University Natural Sciences, 2021, 48 (10): : 187 - 200
  • [29] Robust coupled single-port method based on PMU-based state estimation method for voltage stability assessment
    Derakhshandeh, Sayed Yaser
    Pourbagher, Rohallah
    INTERNATIONAL JOURNAL OF ELECTRICAL POWER & ENERGY SYSTEMS, 2023, 151
  • [30] An Improved Similarity-based Prognostics Method for Remaining Useful Life Estimation of Aero-Engine
    Han Bingjie
    Niu Wei
    Wang Jichao
    2021 IEEE/ACIS 21ST INTERNATIONAL FALL CONFERENCE ON COMPUTER AND INFORMATION SCIENCE (ICIS 2021-FALL), 2021, : 38 - 41