On the selection of optimal subdata for big data regression based on leverage scores

被引:0
|
作者
Chasiotis, Vasilis [1 ]
Karlis, Dimitris [1 ]
机构
[1] Athens Univ Econ & Business, Dept Stat, Athens, Greece
关键词
D-optimal designs; Design of experiments; Subdata; Linear regression; Information matrix;
D O I
10.1007/s42519-024-00420-4
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
The demand of computational resources for the modeling process increases as the scale of the datasets does, since traditional approaches for regression involve inverting huge data matrices. The main problem relies on the large data size, and so a standard approach is subsampling that aims at obtaining the most informative portion of the big data. In the current paper, we explore an existing approach based on leverage scores, proposed for subdata selection in linear model discrimination. Our objective is to propose the aforementioned approach for selecting the most informative data points to estimate unknown parameters in both the first-order linear model and a model with interactions. We conclude that the approach based on leverage scores improves existing approaches, providing simulation experiments as well as a real data application.
引用
收藏
页数:19
相关论文
共 50 条
  • [1] Information-Based Optimal Subdata Selection for Big Data Linear Regression
    Wang, HaiYing
    Yang, Min
    Stufken, John
    JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2019, 114 (525) : 393 - 405
  • [2] Information-based optimal subdata selection for big data logistic regression
    Cheng, Qianshun
    Wang, HaiYing
    Yang, Min
    JOURNAL OF STATISTICAL PLANNING AND INFERENCE, 2020, 209 : 112 - 122
  • [3] Subdata selection based on orthogonal array for big data
    Ren, Min
    Zhao, Sheng-Li
    COMMUNICATIONS IN STATISTICS-THEORY AND METHODS, 2023, 52 (15) : 5483 - 5501
  • [4] Model-Robust Subdata Selection for Big Data
    Shi, Chenlu
    Tang, Boxin
    JOURNAL OF STATISTICAL THEORY AND PRACTICE, 2021, 15 (04)
  • [5] Model-Robust Subdata Selection for Big Data
    Chenlu Shi
    Boxin Tang
    Journal of Statistical Theory and Practice, 2021, 15
  • [6] Distributed subdata selection for big data via sampling-based approach
    Zhang, Haixiang
    Wang, HaiYing
    COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2021, 153
  • [7] pylspack: Parallel Algorithms and Data Structures for Sketching, Column Subset Selection, Regression, and Leverage Scores
    Sobczyk, Aleksandros
    Gallopoulos, Efstratios
    ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2022, 48 (04):
  • [8] Distributed information-based optimal sub-data selection algorithm for big data logistic regression
    Wan, Xiangxin
    Liu, Yanyan
    Ye, Xin
    COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2025,
  • [9] Information-based optimal subdata selection for non-linear models
    Yu, Jun
    Liu, Jiaqi
    Wang, HaiYing
    STATISTICAL PAPERS, 2023, 64 (04) : 1069 - 1093
  • [10] Information-based optimal subdata selection for non-linear models
    Jun Yu
    Jiaqi Liu
    HaiYing Wang
    Statistical Papers, 2023, 64 : 1069 - 1093