The A-optimal subsampling approach to the analysis of count data of massive size

被引:1
|
作者
Tan, Fei [1 ]
Zhao, Xiaofeng [2 ]
Peng, Hanxiang [1 ]
机构
[1] Indiana Univ Indianapolis, Dept Math Sci, Indianapolis, IN USA
[2] North China Univ Water Resources & Elect Power, Sch Math & Stat, Zhengzhou, Henan, Peoples R China
关键词
A-optimality; big data; generalised linear models; negative binomial regression; optimal subsampling; Poisson regression; hat matrix; truncation;
D O I
10.1080/10485252.2024.2383307
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
The uniform and the statistical leverage-scores-based (nonuniform) distributions are often used in the development of randomised algorithms and the analysis of data of massive size. Both distributions, however, are not effective in extraction of important information in data. In this article, we construct the A-optimal subsampling estimators of parameters in generalised linear models (GLM) to approximate the full-data estimators, and derive the A-optimal distributions based on the criterion of minimising the sum of the component variances of the subsampling estimators. As calculating the distributions has the same time complexity as the full-data estimator, we generalise the Scoring Algorithm introduced in Zhang, Tan, and Peng ((2023), 'Sample Size Determination forMultidimensional Parameters and A-Optimal Subsampling in a Big Data Linear Regression Model', To appear in the Journal of Statistical Computation and Simulation. Preprint. Available at https://math.indianapolis.iu.edu/hanxpeng/SSD_23_4.pdf) in a Big Data linear model to GLM using the iterative weighted least squares. The paper presents a comprehensive numerical evaluation of our approach using simulated and real data through the comparison of its performance with the uniform and the leverage-scores- subsamplings. The results exhibited that our approach substantially outperformed the uniform and the leverage-scores subsamplings and the Algorithm significantly reduced the computing time required for implementing the full-data estimator.
引用
收藏
页数:29
相关论文
共 50 条
  • [21] Optimal subsampling for parametric accelerated failure time models with massive survival data
    Yang, Zehan
    Wang, HaiYing
    Yan, Jun
    STATISTICS IN MEDICINE, 2022, 41 (27) : 5421 - 5431
  • [22] Deterministic subsampling for logistic regression with massive data
    Song, Yan
    Dai, Wenlin
    COMPUTATIONAL STATISTICS, 2024, 39 (02) : 709 - 732
  • [23] Deterministic subsampling for logistic regression with massive data
    Yan Song
    Wenlin Dai
    Computational Statistics, 2024, 39 : 709 - 732
  • [24] A Sequential Addressing Subsampling Method for Massive Data Analysis Under Memory Constraint
    Pan, Rui
    Zhu, Yingqiu
    Guo, Baishan
    Zhu, Xuening
    Wang, Hansheng
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (09) : 9502 - 9513
  • [25] CluBear: a subsampling package for interactive statistical analysis with massive data on a single machine
    Xu, Ke
    Zhu, Yingqiu
    Liu, Yijing
    Wang, Hansheng
    COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2024,
  • [26] Optimal subsampling for semi-parametric accelerated failure time models with massive survival data using a rank-based approach
    Yang, Zehan
    Wang, Haiying
    Yan, Jun
    STATISTICS IN MEDICINE, 2024, 43 (24) : 4650 - 4666
  • [27] Random perturbation subsampling for rank regression with massive data
    He, Sijin
    Xia, Xiaochao
    STATISTICS AND COMPUTING, 2025, 35 (01)
  • [28] OPTIMAL SUBSAMPLING ALGORITHMS FOR BIG DATA REGRESSIONS
    Ai, Mingyao
    Yu, Jun
    Zhang, Huiming
    Wang, HaiYing
    STATISTICA SINICA, 2021, 31 (02) : 749 - 772
  • [29] Optimal subsampling for quantile regression in big data
    Wang, Haiying
    Ma, Yanyuan
    BIOMETRIKA, 2021, 108 (01) : 99 - 112
  • [30] Characterization of count data distributions involving additivity and binomial subsampling
    Puig, Pedro
    Valero, Jordi
    BERNOULLI, 2007, 13 (02) : 544 - 555