Maintaining very large random samples using the geometric file

被引:6
|
作者
Pol, Abhijit [1 ]
Jermaine, Christopher [1 ]
Arumugam, Subramanian [1 ]
机构
[1] Univ Florida, Gainesville, FL 32611 USA
来源
VLDB JOURNAL | 2008年 / 17卷 / 05期
基金
美国国家科学基金会;
关键词
Data Stream; Naive Algorithm; Reservoir Sampling; Large Random Sample; Data Management Tool;
D O I
10.1007/s00778-007-0048-z
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Random sampling is one of the most fundamental data management tools available. However, most current research involving sampling considers the problem of how to use a sample, and not how to compute one. The implicit assumption is that a "sample" is a small data structure that is easily maintained as new data are encountered, even though simple statistical arguments demonstrate that very large samples of gigabytes or terabytes in size can be necessary to provide high accuracy. No existing work tackles the problem of maintaining very large, disk-based samples from a data management perspective, and no techniques now exist for maintaining very large samples in an online manner from streaming data. In this paper, we present online algorithms for maintaining on-disk samples that are gigabytes or terabytes in size. The algorithms are designed for streaming data, or for any environment where a large sample must be maintained online in a single pass through a data set. The algorithms meet the strict requirement that the sample always be a true, statistically random sample (without replacement) of all of the data processed thus far. We also present algorithms to retrieve small size random sample from large disk-based sample which may be used for various purposes including statistical analyses by a DBMS.
引用
收藏
页码:997 / 1018
页数:22
相关论文
共 50 条
  • [21] Large Connectivity for Dynamic Random Geometric Graphs
    Diaz, Josep
    Mitsche, Dieter
    Perez-Gimenez, Xavier
    [J]. IEEE TRANSACTIONS ON MOBILE COMPUTING, 2009, 8 (06) : 821 - 835
  • [22] Accuracy of random groups equating with very small samples
    Skaggs, G
    [J]. JOURNAL OF EDUCATIONAL MEASUREMENT, 2005, 42 (04) : 309 - 330
  • [23] Random walk with jumps in large-scale random geometric graphs
    Tzevelekas, Leonidas
    Oikonomou, Konstantinos
    Stavrakakis, Ioannis
    [J]. COMPUTER COMMUNICATIONS, 2010, 33 (13) : 1505 - 1514
  • [24] Descents following maximal values in samples of geometric random variables
    Archibald, Margaret
    Blecher, Aubrey
    Brennan, Charlotte
    Knopfmacher, Arnold
    [J]. STATISTICS & PROBABILITY LETTERS, 2015, 97 : 229 - 240
  • [25] COMPUTER ARCHITECTURE FOR A SURROGATE FILE TO A VERY LARGE DATA/KNOWLEDGE BASE
    BERRA, PB
    CHUNG, SM
    HACHEM, NI
    [J]. COMPUTER, 1987, 20 (03) : 25 - 32
  • [26] Maintaining consistency of File system by Monitoring file system parameters at Runtime using Consistency Checking Rules
    Meshram, Aniket G.
    Gore, Sonal
    [J]. 2015 4TH INTERNATIONAL CONFERENCE ON RELIABILITY, INFOCOM TECHNOLOGIES AND OPTIMIZATION (ICRITO) (TRENDS AND FUTURE DIRECTIONS), 2015,
  • [27] Improved file synchronization techniques for maintaining large replicated collections over slow networks
    Suel, T
    Noel, P
    Trendafilov, D
    [J]. 20TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 2004, : 153 - 164
  • [28] A Geometric Approach to Train SVM on Very Large Data Sets
    Zeng, Zhi-Qiang
    Xu, Hua-Rong
    Xie, Yan-Qi
    Gao, Ji
    [J]. 2008 3RD INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEM AND KNOWLEDGE ENGINEERING, VOLS 1 AND 2, 2008, : 991 - +
  • [29] Geometric calibration of very large microphone arrays in mismatched free field
    Vanwynsberghe, Charles
    Challande, Pascal
    Ollivier, Francois
    Marchal, Jacques
    Marchiano, Regis
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2019, 145 (01): : 215 - 227
  • [30] Using large samples in econometrics
    MacKinnon, James G.
    [J]. JOURNAL OF ECONOMETRICS, 2023, 235 (02) : 922 - 926