Sample size determination for multidimensional parameters and the A-optimal subsampling in a big data linear regression model

被引:0
|
作者
Zhang, Sheng [1 ]
Tan, Fei [1 ]
Peng, Hanxiang [1 ]
机构
[1] Indiana Univ Indianapolis, Dept Math Sci, 402 N Blackford St LD 270, Indianapolis, IN 46202 USA
关键词
Asymptotic normality; A-optimalilty; big data; least squares estimate; sample size determination; APPROXIMATION;
D O I
10.1080/00949655.2024.2434669
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
To efficiently approximate the least squares estimator (LSE) in a Big Data linear regression model using a subsampling approach, optimal sampling distributions were derived by minimizing the trace norm of the covariance matrix of a smooth function of the subsampling LSE. An algorithm was developed that significantly reduces the computation time for the subsampling LSE compared to the full-sample LSE. Additionally, the subsampling LSE was shown to be asymptotically normal almost surely for an arbitrary sampling distribution under suitable conditions. Motivated by the need for subsampling in Big Data analysis and data splitting in machine learning, we investigated sample size determination (SSD) for multidimensional parameters and derived analytical formulas for calculating sample sizes. Through extensive simulations and real-world data applications, we assessed the numerical properties of both the subsampling approach and SSD methodology. Our findings revealed that the A-optimal subsampling method significantly outperformed uniform and leverage-score subsampling techniques. Furthermore, the algorithm considerably reduced the computational time required for implementing the full sample LSE. Additionally, the SSD provided a theoretical basis for selecting sample sizes.
引用
收藏
页码:628 / 653
页数:26
相关论文
共 50 条
  • [41] Development of the Algorithm for Finding the Optimal Path in a Transport Network with Dynamic Parameters based on the Multidimensional Data Model
    Sokolov, Alexsey
    Bakulev, Alexander
    Fetisova, Tatyana
    Bakuleva, Marina
    2019 8TH MEDITERRANEAN CONFERENCE ON EMBEDDED COMPUTING (MECO), 2019, : 262 - 265
  • [42] DF classification algorithm for constructing a small sample size of data-oriented DF regression model
    Xia, Heng
    Tang, Jian
    Qiao, Junfei
    Zhang, Jian
    Yu, Wen
    NEURAL COMPUTING & APPLICATIONS, 2022, 34 (04): : 2785 - 2810
  • [43] DF classification algorithm for constructing a small sample size of data-oriented DF regression model
    Heng Xia
    Jian Tang
    Junfei Qiao
    Jian Zhang
    Wen Yu
    Neural Computing and Applications, 2022, 34 : 2785 - 2810
  • [44] SAMPLE-SIZE DETERMINATION FOR COHORT STUDIES UNDER AN EXPONENTIAL COVARIATE MODEL WITH GROUPED DATA
    LUI, KJ
    BIOMETRICS, 1993, 49 (03) : 773 - 778
  • [45] The optimum of heat recovery - Determination of the optimal heat recovery based on a multiple non-linear regression model
    Kaup, Christoph
    JOURNAL OF BUILDING ENGINEERING, 2021, 38
  • [46] A linear model of acoustic-to-facial mapping: Model parameters, data set size, and generalization across speakers
    Craig, Matthew S.
    Van Lieshout, Pascal
    Wong, Willy
    Journal of the Acoustical Society of America, 2008, 124 (05): : 3183 - 3190
  • [47] A linear model of acoustic-to-facial mapping: Model parameters, data set size, and generalization across speakers
    Craig, Matthew S.
    van Lieshout, Pascal
    Wong, Willy
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2008, 124 (05): : 3183 - 3190
  • [48] Optimal Trade-Off Between Sample Size and Precision of Supervision for the Fixed Effects Panel Data Model
    Gnecco, Giorgio
    Nutarelli, Federico
    MACHINE LEARNING, OPTIMIZATION, AND DATA SCIENCE, 2019, 11943 : 531 - 542
  • [49] Sample size calculations for the differential expression analysis of RNA-seq data using a negative binomial regression model
    Li, Xiaohong
    Wu, Dongfeng
    Cooper, Nigel G. F.
    Rai, Shesh N.
    STATISTICAL APPLICATIONS IN GENETICS AND MOLECULAR BIOLOGY, 2019, 18 (01)
  • [50] A Sample-Wise Data Driven Control Solver for the Stochastic Optimal Control Problem with Unknown Model Parameters
    Archibald, Richard
    Bao, Feng
    Yong, Jiongmin
    COMMUNICATIONS IN COMPUTATIONAL PHYSICS, 2023, 33 (04) : 1132 - 1163