Statistical methodology for massive datasets and model selection

被引:2
|
作者
Babu, GJ [1 ]
McDermott, JP [1 ]
机构
[1] Penn State Univ, Dept Stat, University Pk, PA 16802 USA
来源
关键词
Akalke Information Criterion; Bayesian Information Criterion; streaming data; convex hull; log-likelihood; maximum likelihood; leave one out jackknife type method;
D O I
10.1117/12.460339
中图分类号
P1 [天文学];
学科分类号
0704 ;
摘要
Astronomy is facing a revolution in data collection, storage, analysis, and interpretation of large datasets. The data volumes here are several orders of magnitude larger than what astronomers and statisticians are used to dealing with, and the old methods simply do not work. The National Virtual Observatory (NVO) initiative has recently emerged in recognition of this need and to federate numerous large digital sky archives, both ground based and space based, and develop tools to explore and understand these vast volumes of data. In this paper, we address some of the critically important statistical challenges raised by the NVO. In particular a low-storage, single-pass, sequential method for simultaneous estimation of multiple quantiles for massive datasets will be presented. Density estimation based on this procedure and a multivariate extension will also be discussed. The NVO also requires statistical tools to analyze moderate size databases. Model selection is an important issue for many astrophysical databases. We present a simple likelihood based 'leave one out' method to select the best among the several possible alternatives. The performance of the method is compared to those based on Akaike Information Criterion and Bayesian Information Criterion.
引用
收藏
页码:228 / 237
页数:10
相关论文
共 50 条
  • [1] Statistical challenges with massive datasets in particle physics
    Knuteson, B
    Padley, P
    [J]. JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2003, 12 (04) : 808 - 828
  • [2] Statistical inference in massive datasets by empirical likelihood
    Ma, Xuejun
    Wang, Shaochen
    Zhou, Wang
    [J]. COMPUTATIONAL STATISTICS, 2022, 37 (03) : 1143 - 1164
  • [3] Statistical inference in massive datasets by empirical likelihood
    Xuejun Ma
    Shaochen Wang
    Wang Zhou
    [J]. Computational Statistics, 2022, 37 : 1143 - 1164
  • [4] Tests and variables selection on regression analysis for massive datasets
    Fan, Tsai-Hung
    Cheng, Kuang-Fu
    [J]. DATA & KNOWLEDGE ENGINEERING, 2007, 63 (03) : 811 - 819
  • [5] Massive datasets
    Kettenring, Jon R.
    [J]. WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL STATISTICS, 2009, 1 (01): : 25 - 32
  • [6] STATISTICAL METHODOLOGY IN MODEL BUILDING
    HILL, WJ
    KITTRELL, JR
    [J]. TECHNOMETRICS, 1966, 8 (01) : 207 - &
  • [7] THE EFFECTS OF CHANGE IN STATISTICAL PROPERTIES OF DATASETS ON FEATURE SELECTION STABILITY
    Chelvan, Mohana P.
    Perumal, K.
    [J]. 2017 INTERNATIONAL CONFERENCE ON INFORMATION COMMUNICATION AND EMBEDDED SYSTEMS (ICICES), 2017,
  • [8] Fast Robust Model Selection in Large Datasets
    Dupuis, Debbie J.
    Victoria-Feser, Maria-Pia
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2011, 106 (493) : 203 - 212
  • [9] Robust distributed estimation and variable selection for massive datasets via rank regression
    Jiaming Luan
    Hongwei Wang
    Kangning Wang
    Benle Zhang
    [J]. Annals of the Institute of Statistical Mathematics, 2022, 74 : 435 - 450
  • [10] Mining of Massive Datasets
    Richter, Lothar
    [J]. BIOMETRICS, 2018, 74 (04) : 1520 - 1521