Sample size determination for multidimensional parameters and the A-optimal subsampling in a big data linear regression model

被引:0
|
作者
Zhang, Sheng [1 ]
Tan, Fei [1 ]
Peng, Hanxiang [1 ]
机构
[1] Indiana Univ Indianapolis, Dept Math Sci, 402 N Blackford St LD 270, Indianapolis, IN 46202 USA
关键词
Asymptotic normality; A-optimalilty; big data; least squares estimate; sample size determination; APPROXIMATION;
D O I
10.1080/00949655.2024.2434669
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
To efficiently approximate the least squares estimator (LSE) in a Big Data linear regression model using a subsampling approach, optimal sampling distributions were derived by minimizing the trace norm of the covariance matrix of a smooth function of the subsampling LSE. An algorithm was developed that significantly reduces the computation time for the subsampling LSE compared to the full-sample LSE. Additionally, the subsampling LSE was shown to be asymptotically normal almost surely for an arbitrary sampling distribution under suitable conditions. Motivated by the need for subsampling in Big Data analysis and data splitting in machine learning, we investigated sample size determination (SSD) for multidimensional parameters and derived analytical formulas for calculating sample sizes. Through extensive simulations and real-world data applications, we assessed the numerical properties of both the subsampling approach and SSD methodology. Our findings revealed that the A-optimal subsampling method significantly outperformed uniform and leverage-score subsampling techniques. Furthermore, the algorithm considerably reduced the computational time required for implementing the full sample LSE. Additionally, the SSD provided a theoretical basis for selecting sample sizes.
引用
收藏
页码:628 / 653
页数:26
相关论文
共 50 条
  • [31] RELATIONSHIP BETWEEN SAMPLE-SIZE AND NUMBER OF VARIABLES IN A LINEAR-REGRESSION MODEL
    OLIKER, VI
    COMMUNICATIONS IN STATISTICS PART A-THEORY AND METHODS, 1978, 7 (06): : 509 - 516
  • [32] Design and Sample Size Determination for Experiments on Nonresponse Followup using a Sequential Regression Model
    Raim, Andrew M.
    Mathew, Thomas
    Sellers, Kimberly F.
    Ellis, Renee
    Meyers, Mikelyn
    JOURNAL OF OFFICIAL STATISTICS, 2023, 39 (02) : 173 - 202
  • [33] A non-linear regression model for inertia identification using synchrophasors and Big Data
    Quiroz, Juan
    Soto, Ismael
    Toledo-Mercado, Esteban
    Chavez, Hector
    Zamorano-Illanes, Raul
    Pereira-Mendoza, Jonathan
    2021 IEEE IFAC INTERNATIONAL CONFERENCE ON AUTOMATION/XXIV CONGRESS OF THE CHILEAN ASSOCIATION OF AUTOMATIC CONTROL (IEEE IFAC ICA - ACCA2021), 2021,
  • [34] OPTIMAL PREDICTIVE CONTROL STRATEGIES FOR SYSTEMS WITH RANDOM PARAMETERS DESCRIBED BY MULTIDIMENSIONAL MARKOV SWITCHING REGRESSION MODEL
    Dombrovskii, V. V.
    Pashinskaya, T. Yu
    VESTNIK TOMSKOGO GOSUDARSTVENNOGO UNIVERSITETA-UPRAVLENIE VYCHISLITELNAJA TEHNIKA I INFORMATIKA-TOMSK STATE UNIVERSITY JOURNAL OF CONTROL AND COMPUTER SCIENCE, 2019, (48): : 4 - 12
  • [35] Bayesian sample size determination for estimating binomial parameters from data subject to misclassification
    Rahme, E
    Joseph, L
    Gyorkos, TW
    JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES C-APPLIED STATISTICS, 2000, 49 : 119 - 128
  • [37] Determination of the optimal mathematical model, sample size, digital data and transect spacing to map CEC (Cation exchange capacity) in a sugarcane field
    Zhao, Xueyu
    Arshad, Maryem
    Li, Nan
    Zare, Ehsan
    Triantafilis, John
    COMPUTERS AND ELECTRONICS IN AGRICULTURE, 2020, 173
  • [38] An optimal generic model for multi-parameters and big data optimizing: a laboratory experimental study
    Utama, D. N.
    Ani, N.
    Iqbal, M. M.
    2ND INTERNATIONAL CONFERENCE ON COMPUTING AND APPLIED INFORMATICS 2017, 2018, 978
  • [39] Prediction of corn price fluctuation based on multiple linear regression analysis model under big data
    Ge, Yan
    Wu, Haixia
    NEURAL COMPUTING & APPLICATIONS, 2020, 32 (22): : 16843 - 16855
  • [40] Prediction of corn price fluctuation based on multiple linear regression analysis model under big data
    Yan Ge
    Haixia Wu
    Neural Computing and Applications, 2020, 32 : 16843 - 16855