The A-optimal subsampling approach to the analysis of count data of massive size

被引:1
|
作者
Tan, Fei [1 ]
Zhao, Xiaofeng [2 ]
Peng, Hanxiang [1 ]
机构
[1] Indiana Univ Indianapolis, Dept Math Sci, Indianapolis, IN USA
[2] North China Univ Water Resources & Elect Power, Sch Math & Stat, Zhengzhou, Henan, Peoples R China
关键词
A-optimality; big data; generalised linear models; negative binomial regression; optimal subsampling; Poisson regression; hat matrix; truncation;
D O I
10.1080/10485252.2024.2383307
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
The uniform and the statistical leverage-scores-based (nonuniform) distributions are often used in the development of randomised algorithms and the analysis of data of massive size. Both distributions, however, are not effective in extraction of important information in data. In this article, we construct the A-optimal subsampling estimators of parameters in generalised linear models (GLM) to approximate the full-data estimators, and derive the A-optimal distributions based on the criterion of minimising the sum of the component variances of the subsampling estimators. As calculating the distributions has the same time complexity as the full-data estimator, we generalise the Scoring Algorithm introduced in Zhang, Tan, and Peng ((2023), 'Sample Size Determination forMultidimensional Parameters and A-Optimal Subsampling in a Big Data Linear Regression Model', To appear in the Journal of Statistical Computation and Simulation. Preprint. Available at https://math.indianapolis.iu.edu/hanxpeng/SSD_23_4.pdf) in a Big Data linear model to GLM using the iterative weighted least squares. The paper presents a comprehensive numerical evaluation of our approach using simulated and real data through the comparison of its performance with the uniform and the leverage-scores- subsamplings. The results exhibited that our approach substantially outperformed the uniform and the leverage-scores subsamplings and the Algorithm significantly reduced the computing time required for implementing the full-data estimator.
引用
收藏
页数:29
相关论文
共 50 条
  • [41] Subsampling Suffices for Adaptive Data Analysis
    Blanc, Guy
    PROCEEDINGS OF THE 55TH ANNUAL ACM SYMPOSIUM ON THEORY OF COMPUTING, STOC 2023, 2023, : 999 - 1012
  • [43] Bootstrapping Analysis of Lifetime Data with Subsampling
    Wang, Guodong
    Niu, Zhanwen
    Lv, Shanshan
    Qu, Liang
    He, Zhen
    QUALITY AND RELIABILITY ENGINEERING INTERNATIONAL, 2016, 32 (05) : 1945 - 1953
  • [44] Construction of A-optimal balanced treatment incomplete block designs: An algorithmic approach
    Mandal, B. N.
    Parsad, Rajender
    Dash, Sukanta
    COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2020, 49 (06) : 1653 - 1664
  • [45] Optimal Subsampling for Data Streams with Measurement Constrained Categorical Responses
    Yu, Jun
    Ye, Zhiqiang
    Ai, Mingyao
    Ma, Ping
    JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2024,
  • [46] Subsampling approach for least squares fitting of semi-parametric accelerated failure time models to massive survival data
    Yang, Zehan
    Wang, HaiYing
    Yan, Jun
    STATISTICS AND COMPUTING, 2024, 34 (02)
  • [47] Subsampling approach for least squares fitting of semi-parametric accelerated failure time models to massive survival data
    Zehan Yang
    HaiYing Wang
    Jun Yan
    Statistics and Computing, 2024, 34
  • [48] Outcome dependent subsampling divide and conquer in generalized linear models for massive data
    Yin, Jie
    Ding, Jieli
    Yang, Changming
    JOURNAL OF STATISTICAL PLANNING AND INFERENCE, 2025, 237
  • [49] An optimal approach for social data analysis in Big Data
    Kamala, V. R.
    MaryGladence, L.
    2015 INTERNATIONAL CONFERENCE ON COMPUTATION OF POWER, ENERGY, INFORMATION AND COMMUNICATION (ICCPEIC), 2015, : 205 - 208
  • [50] DsubCox: a fast subsampling algorithm for Cox model with distributed and massive survival data
    Zhang, Haixiang
    Li, Yang
    Wang, Haiying
    INTERNATIONAL JOURNAL OF BIOSTATISTICS, 2025,