Generalized substring selectivity estimation

被引:7
|
作者
Chen, ZY
Korn, F
Koudas, N
Muthukrishnan, S
机构
[1] Cornell Univ, Dept Comp Sci, Ithaca, NY 14853 USA
[2] AT&T Labs Res, Florham Pk, NJ 07932 USA
关键词
D O I
10.1016/S0022-0000(02)00031-4
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In a variety of settings from relational databases to LDAP to Web applications, there is an increasing need to quickly and accurately estimate the count of tuples (LDAP entries, Web documents, etc.) matching Boolean substring queries. In providing such selectivity estimates, the correlation between different occurrences of substrings is crucial. Selectivity estimation for generalized Boolean queries has not been studied previously; our own prior work, which is discussed and extended herein, applies to the case of one-dimensional Boolean queries [CKKM00]. Existing methods for the case of multidimensional conjunctive queries approximate selectivities by explicitly storing cross-counts of frequently co-occurring combinations of substrings; estimates are obtained by parsing the query into multidimensional substrings corresponding to stored cross-counts and applying probabilistic formulae. The major problem with these methods is that the number of cross-counts stored by known methods increases exponentially with the number of dimensions (a "space dimensionality explosion") due to the need to capture the correlation amongst the dimensions. Hence, given a limited amount of space, none of the existing methods can reliably give accurate estimates. Moreover, these methods do not generalize to Boolean queries gracefully. We present a novel approach to selectivity estimation for generalized Boolean substring queries with a focus on the two cases of (1) conjunctive multidimensional and (2) Boolean queries. Our approach does not explicitly store crosscounts, but rather generates them on-the-fly. We employ a Monte Carlo technique called set hashing to succinctly represent the set of tuples containing a given substring as a signature vector of hash values; any combination of set hash signatures gives a cross-count when intersected. Thus, using only linear storage, a large number of cross-counts can be generated including those for complex co-occurrences of substrings. The cross-counts generated by our methods are not exact, but they are adequate for selectivity estimation. We present results from an extensive experimental evaluation of our approach on real data sets. For the case of multidimensional conjunctive queries, our approach achieves better accuracy by an order of magnitude, and scales much more gracefully to higher dimensions, than existing methods. Surprisingly, even though our approach involves generating cross-counts on-the-fly, estimation is very fast, taking 200 is on a data set of size 6 MB. For the case of Boolean queries, our experiments also demonstrate the superiority of this approach over a straightforward independence-based approach wherein correlations are not captured. (C) 2003 Published by Elsevier Science (USA).
引用
下载
收藏
页码:98 / 132
页数:35
相关论文
共 50 条
  • [1] Multi-dimensional substring selectivity estimation
    Jagadish, HV
    Kapitskaia, O
    Ng, RT
    Srivastava, D
    PROCEEDINGS OF THE TWENTY-FIFTH INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES, 1999, : 387 - 398
  • [2] One-dimensional and multi-dimensional substring selectivity estimation
    Jagadish, HV
    Kapitskaia, O
    Ng, RT
    Srivastava, D
    VLDB JOURNAL, 2000, 9 (03): : 214 - 230
  • [3] One-dimensional and multi-dimensional substring selectivity estimation
    H.V. Jagadish
    Olga Kapitskaia
    Raymond T. Ng
    Divesh Srivastava
    The VLDB Journal, 2000, 9 : 214 - 230
  • [4] Generalized Substring Compression
    Keller, Orgad
    Kopelowitz, Tsvi
    Landau, Shir
    Lewenstein, Moshe
    COMBINATORIAL PATTERN MATCHING, PROCEEDINGS, 2009, 5577 : 26 - 38
  • [5] Generalized substring compression
    Keller, Orgad
    Kopelowitz, Tsvi
    Feibish, Shir Landau
    Lewenstein, Moshe
    THEORETICAL COMPUTER SCIENCE, 2014, 525 : 42 - 54
  • [6] Generalized closest substring encryption
    Fuchun Guo
    Willy Susilo
    Yi Mu
    Designs, Codes and Cryptography, 2016, 80 : 103 - 124
  • [7] Generalized closest substring encryption
    Guo, Fuchun
    Susilo, Willy
    Mu, Yi
    DESIGNS CODES AND CRYPTOGRAPHY, 2016, 80 (01) : 103 - 124
  • [8] Substring Density Estimation From Traces
    Mazooji, Kayvon
    Shomorony, Ilan
    IEEE TRANSACTIONS ON INFORMATION THEORY, 2024, 70 (08) : 5782 - 5798
  • [9] Substring count estimation in extremely long strings
    Bae, J
    Lee, S
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2006, E89D (03): : 1148 - 1156
  • [10] Space-Efficient Substring Occurrence Estimation
    Alessio Orlandi
    Rossano Venturini
    Algorithmica, 2016, 74 : 65 - 90