Statistical methodology for massive datasets and model selection

被引:2
|
作者
Babu, GJ [1 ]
McDermott, JP [1 ]
机构
[1] Penn State Univ, Dept Stat, University Pk, PA 16802 USA
来源
关键词
Akalke Information Criterion; Bayesian Information Criterion; streaming data; convex hull; log-likelihood; maximum likelihood; leave one out jackknife type method;
D O I
10.1117/12.460339
中图分类号
P1 [天文学];
学科分类号
0704 ;
摘要
Astronomy is facing a revolution in data collection, storage, analysis, and interpretation of large datasets. The data volumes here are several orders of magnitude larger than what astronomers and statisticians are used to dealing with, and the old methods simply do not work. The National Virtual Observatory (NVO) initiative has recently emerged in recognition of this need and to federate numerous large digital sky archives, both ground based and space based, and develop tools to explore and understand these vast volumes of data. In this paper, we address some of the critically important statistical challenges raised by the NVO. In particular a low-storage, single-pass, sequential method for simultaneous estimation of multiple quantiles for massive datasets will be presented. Density estimation based on this procedure and a multivariate extension will also be discussed. The NVO also requires statistical tools to analyze moderate size databases. Model selection is an important issue for many astrophysical databases. We present a simple likelihood based 'leave one out' method to select the best among the several possible alternatives. The performance of the method is compared to those based on Akaike Information Criterion and Bayesian Information Criterion.
引用
收藏
页码:228 / 237
页数:10
相关论文
共 50 条
  • [21] Model checking for parametric single-index models with massive datasets
    Yang, Xin
    Yan, Qijing
    Wu, Mixia
    [J]. JOURNAL OF STATISTICAL PLANNING AND INFERENCE, 2023, 227 : 129 - 145
  • [22] Error correction for massive datasets
    Bruni, R
    [J]. OPTIMIZATION METHODS & SOFTWARE, 2005, 20 (2-3): : 291 - 310
  • [23] Fitting COVID-19 datasets to a new statistical model
    Gemeay, Ahmed M.
    Tashkandy, Yusra A.
    Bakr, M. E.
    Kumar, Anoop
    Hossain, Md. Moyazzem
    Almetwally, Ehab M.
    [J]. AIP ADVANCES, 2024, 14 (08)
  • [24] PROCESSING MASSIVE DATASETS IN GENOMICS
    Artiguenave, F.
    [J]. GAIA: AT THE FRONTIERS OF ASTROMETRY, 2011, 45 : 95 - 96
  • [25] Regression analysis for massive datasets
    Fan, Tsai-Hung
    Lin, Dennis K. J.
    Cheng, Kuang-Fu
    [J]. DATA & KNOWLEDGE ENGINEERING, 2007, 61 (03) : 554 - 562
  • [26] STATISTICAL METHODOLOGY FOR FOREST HARVESTING MODEL DEVELOPMENT
    HINES, GS
    PADGETT, ML
    WEBSTER, DB
    SIROIS, DL
    [J]. WOOD SCIENCE, 1982, 14 (04): : 178 - 187
  • [27] Statistical Feature Selection From Massive Data in Distribution Fault Diagnosis
    Cai, Yixin
    Chow, Mo-Yuen
    Lu, Wenbin
    Li, Lexin
    [J]. IEEE TRANSACTIONS ON POWER SYSTEMS, 2010, 25 (02) : 642 - 648
  • [28] Preferences in Argumentation for Statistical Model Selection
    Sassoon, Isabel
    Keppens, Jeroen
    Mcburney, Peter
    [J]. COMPUTATIONAL MODELS OF ARGUMENT, 2016, 287 : 53 - 60
  • [29] Objectivity and Underdetermination in Statistical Model Selection
    Sterner, Beckett
    Lidgard, Scott
    [J]. BRITISH JOURNAL FOR THE PHILOSOPHY OF SCIENCE, 2024, 75 (03): : 717 - 739
  • [30] A Statistical Model for System Components Selection
    Gupta, Varuna
    Mazouz, Abdelkader
    Agarwal, Ankur
    Hamza-Lup, Georgiana
    [J]. 2011 IEEE INTERNATIONAL SYSTEMS CONFERENCE (SYSCON 2011), 2011, : 1 - 6