Data splitting strategies for reducing the effect of model selection on inference

被引:0
|
作者
Faraway, JJ [1 ]
机构
[1] Univ Michigan, Dept Stat, Ann Arbor, MI 48109 USA
关键词
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
When an appropriate model for data is not completely known, the data is often used to select a model. Very often inference is then made from the selected model assuming that it had been known from the beginning. Estimates of the error of predictions or other quantities associated with that model take account of the uncertainty about the parameters of the model, but not the uncertainty about the model itself. Such error estimates tend to be too small, especially when the model uncertainty dominates the parametric uncertainty. Models are usually selected on the basis of fit, so typically the data fit the selected model rather well thus making the error seem small, In data splitting, one part of the data is used solely for model selection and the other part for inference thus hopefully avoiding the over-optimism induced by using the same data to both select and estimate the parameters of a model. Data splitting is easy to implement and thus is an attractive alternative to complex methods of adjusting for the effect of model selection on inference. Three tasks need to be performed - model selection, prediction and error assessment. We investigate different strategies for allotting the two parts of the data between these three tasks. We devise a new graphical method for jointly assessing prediction accuracy and error estimates called an honesty plot. The plot can be used to show actual coverage of confidence intervals of any given nominal level. Variable selection, Box-Cox transformation and more complex simulation experiments are used to compare the various strategies. The performance of data-splitting is found to be no better than using all the data for both selection and inference.
引用
收藏
页码:332 / 341
页数:10
相关论文
共 50 条