The A-optimal subsampling approach to the analysis of count data of massive size

被引：1

作者：

Tan, Fei ^{[1
]}

Zhao, Xiaofeng ^{[2
]}

Peng, Hanxiang ^{[1
]}

机构：

[1] Indiana Univ Indianapolis, Dept Math Sci, Indianapolis, IN USA

[2] North China Univ Water Resources & Elect Power, Sch Math & Stat, Zhengzhou, Henan, Peoples R China

来源：

JOURNAL OF NONPARAMETRIC STATISTICS | 2024年

关键词：

A-optimality; big data; generalised linear models; negative binomial regression; optimal subsampling; Poisson regression; hat matrix; truncation;

D O I：

10.1080/10485252.2024.2383307

中图分类号：

O21 [概率论与数理统计]; C8 [统计学];

学科分类号：

020208 ; 070103 ; 0714 ;

摘要：

The uniform and the statistical leverage-scores-based (nonuniform) distributions are often used in the development of randomised algorithms and the analysis of data of massive size. Both distributions, however, are not effective in extraction of important information in data. In this article, we construct the A-optimal subsampling estimators of parameters in generalised linear models (GLM) to approximate the full-data estimators, and derive the A-optimal distributions based on the criterion of minimising the sum of the component variances of the subsampling estimators. As calculating the distributions has the same time complexity as the full-data estimator, we generalise the Scoring Algorithm introduced in Zhang, Tan, and Peng ((2023), 'Sample Size Determination forMultidimensional Parameters and A-Optimal Subsampling in a Big Data Linear Regression Model', To appear in the Journal of Statistical Computation and Simulation. Preprint. Available at https://math.indianapolis.iu.edu/hanxpeng/SSD_23_4.pdf) in a Big Data linear model to GLM using the iterative weighted least squares. The paper presents a comprehensive numerical evaluation of our approach using simulated and real data through the comparison of its performance with the uniform and the leverage-scores- subsamplings. The results exhibited that our approach substantially outperformed the uniform and the leverage-scores subsamplings and the Algorithm significantly reduced the computing time required for implementing the full-data estimator.

引用

页数：29

共 50 条

[21] Optimal subsampling for parametric accelerated failure time models with massive survival data
Yang, Zehan
Wang, HaiYing
Yan, Jun
STATISTICS IN MEDICINE, 2022, 41 (27) : 5421 - 5431
[22] Deterministic subsampling for logistic regression with massive data
Song, Yan
Dai, Wenlin
COMPUTATIONAL STATISTICS, 2024, 39 (02) : 709 - 732
[23] Deterministic subsampling for logistic regression with massive data
Yan Song
Wenlin Dai
Computational Statistics, 2024, 39 : 709 - 732
[24] A Sequential Addressing Subsampling Method for Massive Data Analysis Under Memory Constraint
Pan, Rui
Zhu, Yingqiu
Guo, Baishan
Zhu, Xuening
Wang, Hansheng
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (09) : 9502 - 9513
[25] CluBear: a subsampling package for interactive statistical analysis with massive data on a single machine
Xu, Ke
Zhu, Yingqiu
Liu, Yijing
Wang, Hansheng
COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2024,
[26] Optimal subsampling for semi-parametric accelerated failure time models with massive survival data using a rank-based approach
Yang, Zehan
Wang, Haiying
Yan, Jun
STATISTICS IN MEDICINE, 2024, 43 (24) : 4650 - 4666
[27] Random perturbation subsampling for rank regression with massive data
He, Sijin
Xia, Xiaochao
STATISTICS AND COMPUTING, 2025, 35 (01)
[28] OPTIMAL SUBSAMPLING ALGORITHMS FOR BIG DATA REGRESSIONS
Ai, Mingyao
Yu, Jun
Zhang, Huiming
Wang, HaiYing
STATISTICA SINICA, 2021, 31 (02) : 749 - 772
[29] Optimal subsampling for quantile regression in big data
Wang, Haiying
Ma, Yanyuan
BIOMETRIKA, 2021, 108 (01) : 99 - 112
[30] Characterization of count data distributions involving additivity and binomial subsampling
Puig, Pedro
Valero, Jordi
BERNOULLI, 2007, 13 (02) : 544 - 555

← 1 2 3 4 5 →