The A-optimal subsampling approach to the analysis of count data of massive size

被引:1
|
作者
Tan, Fei [1 ]
Zhao, Xiaofeng [2 ]
Peng, Hanxiang [1 ]
机构
[1] Indiana Univ Indianapolis, Dept Math Sci, Indianapolis, IN USA
[2] North China Univ Water Resources & Elect Power, Sch Math & Stat, Zhengzhou, Henan, Peoples R China
关键词
A-optimality; big data; generalised linear models; negative binomial regression; optimal subsampling; Poisson regression; hat matrix; truncation;
D O I
10.1080/10485252.2024.2383307
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
The uniform and the statistical leverage-scores-based (nonuniform) distributions are often used in the development of randomised algorithms and the analysis of data of massive size. Both distributions, however, are not effective in extraction of important information in data. In this article, we construct the A-optimal subsampling estimators of parameters in generalised linear models (GLM) to approximate the full-data estimators, and derive the A-optimal distributions based on the criterion of minimising the sum of the component variances of the subsampling estimators. As calculating the distributions has the same time complexity as the full-data estimator, we generalise the Scoring Algorithm introduced in Zhang, Tan, and Peng ((2023), 'Sample Size Determination forMultidimensional Parameters and A-Optimal Subsampling in a Big Data Linear Regression Model', To appear in the Journal of Statistical Computation and Simulation. Preprint. Available at https://math.indianapolis.iu.edu/hanxpeng/SSD_23_4.pdf) in a Big Data linear model to GLM using the iterative weighted least squares. The paper presents a comprehensive numerical evaluation of our approach using simulated and real data through the comparison of its performance with the uniform and the leverage-scores- subsamplings. The results exhibited that our approach substantially outperformed the uniform and the leverage-scores subsamplings and the Algorithm significantly reduced the computing time required for implementing the full-data estimator.
引用
收藏
页数:29
相关论文
共 50 条
  • [1] Sample size determination for multidimensional parameters and the A-optimal subsampling in a big data linear regression model
    Zhang, Sheng
    Tan, Fei
    Peng, Hanxiang
    JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION, 2025, 95 (03) : 628 - 653
  • [2] Optimal Subsampling Bootstrap for Massive Data
    Ma, Yingying
    Leng, Chenlei
    Wang, Hansheng
    JOURNAL OF BUSINESS & ECONOMIC STATISTICS, 2024, 42 (01) : 174 - 186
  • [3] Optimal subsampling for modal regression in massive data
    Chao, Yue
    Huang, Lei
    Ma, Xuejun
    Sun, Jiajun
    METRIKA, 2024, 87 (04) : 379 - 409
  • [4] Optimal subsampling for multiplicative regression with massive data
    Wang, Tianzhen
    Zhang, Haixiang
    STATISTICA NEERLANDICA, 2022, 76 (04) : 418 - 449
  • [5] Optimal subsampling for modal regression in massive data
    Yue Chao
    Lei Huang
    Xuejun Ma
    Jiajun Sun
    Metrika, 2024, 87 : 379 - 409
  • [6] Distributed optimal subsampling for quantile regression with massive data
    Chao, Yue
    Ma, Xuejun
    Zhu, Boya
    JOURNAL OF STATISTICAL PLANNING AND INFERENCE, 2024, 233
  • [7] Adaptive iterative Hessian sketch via A-optimal subsampling
    Zhang, Aijun
    Zhang, Hengtao
    Yin, Guosheng
    STATISTICS AND COMPUTING, 2020, 30 (04) : 1075 - 1090
  • [8] Adaptive iterative Hessian sketch via A-optimal subsampling
    Aijun Zhang
    Hengtao Zhang
    Guosheng Yin
    Statistics and Computing, 2020, 30 : 1075 - 1090
  • [9] Feature Screening for Massive Data Analysis by Subsampling
    Zhu, Xuening
    Pan, Rui
    Wu, Shuyuan
    Wang, Hansheng
    JOURNAL OF BUSINESS & ECONOMIC STATISTICS, 2022, 40 (04) : 1892 - 1903
  • [10] Optimal subsampling algorithms for composite quantile regression in massive data
    Jin, Jun
    Liu, Shuangzhe
    Ma, Tiefeng
    STATISTICS, 2023, 57 (04) : 811 - 843