MISS: finding optimal sample sizes for approximate analytics

被引：0

作者：

Xuebin Su

Hongzhi Wang

机构：

[1] Harbin Institute of Technology & Peng Cheng Lab,

来源：

Distributed and Parallel Databases | 2022年 / 40卷

关键词：

OLAP; Approximate Query Processing; Sampling; Bootstrapping; Optimization;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Nowadays, sampling-based Approximate Query Processing (AQP) is widely regarded as a promising way to achieve interactivity in big data analytics. To build such an AQP system, finding the minimal sample size for a query regarding given error constraints in general, called Sample Size Optimization (SSO), is an essential yet unsolved problem. Ideally, the goal of solving the SSO problem is to achieve statistical accuracy, computational efficiency and broad applicability all at the same time. Existing approaches either make idealistic assumptions on the statistical properties of the query, or completely disregard them. This may result in overemphasizing only one of the three goals while neglect the others. To overcome these limitations, we first examine carefully the statistical properties shared by common analytical queries. Then, based on the properties, we propose a linear model describing the relationship between sample sizes and the approximation errors of a query, which is called the error model. Then, we propose a Model-guided Iterative Sample Selection (MISS) framework to solve the SSO problem generally. Afterwards, based on the MISS framework, we propose a concrete algorithm, called L2MISS\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L^{2}\textsc{Miss}$$\end{document}, to find optimal sample sizes under the L2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L^{2}$$\end{document} norm error metric. Moreover, we extend the L2MISS\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L^{2}\textsc{Miss}$$\end{document} algorithm to handle other error metrics. Finally, we show theoretically and empirically that the L2MISS\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L^{2}\textsc{Miss}$$\end{document} algorithm and its extensions achieve satisfactory accuracy and efficiency for a considerably wide range of analytical queries.

引用

页码：165 / 200

页数：35

共 50 条

[1] MISS: finding optimal sample sizes for approximate analytics
Su, Xuebin
Wang, Hongzhi
[J]. DISTRIBUTED AND PARALLEL DATABASES, 2022, 40 (01) : 165 - 200
[2] SEARCH AND OPTIMAL SAMPLE SIZES
MORGAN, PB
[J]. REVIEW OF ECONOMIC STUDIES, 1983, 50 (04): : 659 - 675
[3] Approximate sample sizes required to estimate length distributions
Miranda, L. E.
[J]. TRANSACTIONS OF THE AMERICAN FISHERIES SOCIETY, 2007, 136 (02) : 409 - 415
[4] ON APPROXIMATE SAMPLE SIZES FOR COMPARING 2 INDEPENDENT PROPORTIONS
URY, HK
[J]. BIOMETRICS, 1980, 36 (04) : 736 - 737
[5] On finding approximate optimal paths in weighted regions
Sun, Z
Reif, JH
[J]. JOURNAL OF ALGORITHMS, 2006, 58 (01) : 1 - 32
[6] The Optimal Sample Sizes of the Xbar and CUSUM Charts
Yang, Mei
[J]. 2009 IEEE INTERNATIONAL CONFERENCE ON INDUSTRIAL ENGINEERING AND ENGINEERING MANAGEMENT, VOLS 1-4, 2009, : 1322 - 1326
[7] Optimal sample sizes and statistical decision rules
Patil, Sanket
Salant, Yuval
[J]. THEORETICAL ECONOMICS, 2024, 19 (02) : 583 - 604
[8] Optimal sample sizes for alternative loss functions
Ghosh, D
[J]. AMERICAN STATISTICAL ASSOCIATION - 1996 PROCEEDINGS OF THE SECTION ON SURVEY RESEARCH METHODS, VOLS I AND II, 1996, : 232 - 233
[9] SAMPLE SIZES FOR APPROXIMATE INDEPENDENCE OF LARGEST AND SMALLEST ORDER STATISTICS
WALSH, JE
[J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1970, 65 (330) : 860 - 863
[10] Estimating Sufficient Sample Sizes for Approximate Decision Support Queries
Rudra, Amit
Gopalan, Raj P.
Achuthan, N. R.
[J]. ENTERPRISE INFORMATION SYSTEMS, ICEIS 2013, 2014, 190 : 85 - 99

← 1 2 3 4 5 →