Maintaining very large random samples using the geometric file

被引:6
|
作者
Pol, Abhijit [1 ]
Jermaine, Christopher [1 ]
Arumugam, Subramanian [1 ]
机构
[1] Univ Florida, Gainesville, FL 32611 USA
来源
VLDB JOURNAL | 2008年 / 17卷 / 05期
基金
美国国家科学基金会;
关键词
Data Stream; Naive Algorithm; Reservoir Sampling; Large Random Sample; Data Management Tool;
D O I
10.1007/s00778-007-0048-z
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Random sampling is one of the most fundamental data management tools available. However, most current research involving sampling considers the problem of how to use a sample, and not how to compute one. The implicit assumption is that a "sample" is a small data structure that is easily maintained as new data are encountered, even though simple statistical arguments demonstrate that very large samples of gigabytes or terabytes in size can be necessary to provide high accuracy. No existing work tackles the problem of maintaining very large, disk-based samples from a data management perspective, and no techniques now exist for maintaining very large samples in an online manner from streaming data. In this paper, we present online algorithms for maintaining on-disk samples that are gigabytes or terabytes in size. The algorithms are designed for streaming data, or for any environment where a large sample must be maintained online in a single pass through a data set. The algorithms meet the strict requirement that the sample always be a true, statistically random sample (without replacement) of all of the data processed thus far. We also present algorithms to retrieve small size random sample from large disk-based sample which may be used for various purposes including statistical analyses by a DBMS.
引用
收藏
页码:997 / 1018
页数:22
相关论文
共 50 条
  • [1] Maintaining very large random samples using the geometric file
    Abhijit Pol
    Christopher Jermaine
    Subramanian Arumugam
    [J]. The VLDB Journal, 2008, 17 : 997 - 1018
  • [2] Online maintenance of very large random samples on flash storage
    Nath, Suman
    Gibbons, Phillip B.
    [J]. VLDB JOURNAL, 2010, 19 (01): : 67 - 90
  • [3] Online maintenance of very large random samples on flash storage
    Suman Nath
    Phillip B. Gibbons
    [J]. The VLDB Journal, 2010, 19 : 67 - 90
  • [4] Online Maintenance of Very Large Random Samples on Flash Storage
    Nath, Suman
    Gibbons, Phillip B.
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2008, 1 (01): : 970 - 983
  • [5] Gaps in samples of geometric random variables
    Goh, William M. Y.
    Hitczenko, Pawel
    [J]. DISCRETE MATHEMATICS, 2007, 307 (22) : 2871 - 2890
  • [6] The Case for Sampling on Very Large File Systems
    Goldberg, George
    Harnik, Danny
    Sotnikov, Dmitry
    [J]. 2014 30TH SYMPOSIUM ON MASSIVE STORAGE SYSTEMS AND TECHNOLOGIES (MSST), 2014,
  • [7] An architecture for lifecycle management in very large file systems
    Verma, A
    Sharma, U
    Rubas, J
    Pease, D
    Kaplan, M
    Jain, R
    Devarakonda, M
    Beigi, M
    [J]. Twenty-Second IEEE/Thirteenth NASA Goddard Conference on Mass Storage Systems and Technologies, Proceedings: INFORMATION RETRIEVAL FROM VERY LARGE STORAGE SYSTEMS, 2005, : 160 - 168
  • [8] SEPARATION OF THE MAXIMA IN SAMPLES OF GEOMETRIC RANDOM VARIABLES
    Brennan, Charlotte
    Knopfmacher, Arnold
    Mansour, Toufik
    Wagner, Stephan
    [J]. APPLICABLE ANALYSIS AND DISCRETE MATHEMATICS, 2011, 5 (02) : 271 - 282
  • [9] ON THE MAXIMUM AND ITS UNIQUENESS FOR GEOMETRIC RANDOM SAMPLES
    BRUSS, FT
    OCINNEIDE, CA
    [J]. JOURNAL OF APPLIED PROBABILITY, 1990, 27 (03) : 598 - 610
  • [10] Descent variation of samples of geometric random variables
    Brennan, Charlotte
    Knopfmacher, Arnold
    [J]. DISCRETE MATHEMATICS AND THEORETICAL COMPUTER SCIENCE, 2013, 15 (02): : 1 - 12