Statistical disclosure control via sufficiency under the multiple linear regression model

被引:6
|
作者
Klein, Martin Daniel [1 ]
Datta, Gauri Sankar [1 ,2 ]
机构
[1] US Census Bur, Ctr Stat Res & Methodol, 4600 Silver Hill Rd, Washington, DC 20233 USA
[2] Univ Georgia, Dept Stat, Athens, GA 30602 USA
关键词
Conditional distribution; statistical disclosure control; sufficient statistics; synthetic data;
D O I
10.1080/15598608.2017.1350606
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
In this article we show, under the normal multiple linear regression model, how synthetic data can be generated using the principle of sufficiency. An advantage of this approach is that if the regression model assumed by the synthetic data producer is correctly specified, then the synthetic data have the same joint distribution as the original data, and therefore one can use standard regression methodology and software to analyze the synthetic data. If the same regression model used to generate the synthetic data is also used for data analysis, and the data are analyzed using standard regression methodology, then the synthetic data yield identical inference to that of the original data. We also study the effects of overfitting or under-fitting the linear regression model. We show that even if the data producer overspecifies the regression model when creating the synthetic data, the synthetic data will still have the same distribution as the original data, and hence valid inference can be obtained. However, if the data producer underspecifies the linear regression model, then one cannot expect to obtain valid inference from the synthetic data. The disclosure risk of the proposed method relative to a standard synthetic data method is also examined.
引用
收藏
页码:100 / 110
页数:11
相关论文
共 50 条