Optimal subsample selection for massive logistic regression with distributed data

被引:17
|
作者
Zuo, Lulu [1 ]
Zhang, Haixiang [1 ]
Wang, HaiYing [2 ]
Sun, Liuquan [3 ]
机构
[1] Tianjin Univ, Ctr Appl Math, Tianjin 300072, Peoples R China
[2] Univ Connecticut, Dept Stat, Mansfield, CT 06269 USA
[3] Chinese Acad Sci, Acad Math & Syst Sci, Beijing 100190, Peoples R China
基金
中国国家自然科学基金; 美国国家科学基金会;
关键词
Allocation size; Big data; Distributed and massive data; Subsample estimator; Subsampling probabilities; FRAMEWORK; INFERENCE;
D O I
10.1007/s00180-021-01089-0
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
With the emergence of big data, it is increasingly common that the data are distributed. i.e., the data are stored at many distributed sites (machines or nodes) owing to data collection or business operations, etc. We propose a distributed subsampling procedure in such a setting to efficiently approximate the maximum likelihood estimator for the logistic regression. We establish the consistency and asymptotic normality of the subsample estimator given the full data. The optimal subsampling probabilities and optimal allocation sizes are explicitly obtained. We develop a two-step algorithm to approximate the optimal subsampling procedure. Numerical simulations and an application to airline data are presented to evaluate the performance of our subsampling method.
引用
收藏
页码:2535 / 2562
页数:28
相关论文
共 50 条
  • [1] Optimal subsample selection for massive logistic regression with distributed data
    Lulu Zuo
    Haixiang Zhang
    HaiYing Wang
    Liuquan Sun
    Computational Statistics, 2021, 36 : 2535 - 2562
  • [2] Distributed optimal subsampling for quantile regression with massive data
    Chao, Yue
    Ma, Xuejun
    Zhu, Boya
    JOURNAL OF STATISTICAL PLANNING AND INFERENCE, 2024, 233
  • [3] Distributed information-based optimal sub-data selection algorithm for big data logistic regression
    Wan, Xiangxin
    Liu, Yanyan
    Ye, Xin
    COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2025,
  • [4] Unified distributed robust regression and variable selection framework for massive data
    Wang, Kangning
    EXPERT SYSTEMS WITH APPLICATIONS, 2021, 186
  • [5] Deterministic subsampling for logistic regression with massive data
    Song, Yan
    Dai, Wenlin
    COMPUTATIONAL STATISTICS, 2024, 39 (02) : 709 - 732
  • [6] Deterministic subsampling for logistic regression with massive data
    Yan Song
    Wenlin Dai
    Computational Statistics, 2024, 39 : 709 - 732
  • [7] Information-based optimal subdata selection for big data logistic regression
    Cheng, Qianshun
    Wang, HaiYing
    Yang, Min
    JOURNAL OF STATISTICAL PLANNING AND INFERENCE, 2020, 209 : 112 - 122
  • [8] Distributed quantile regression for massive heterogeneous data
    Hu, Aijun
    Jiao, Yuling
    Liu, Yanyan
    Shi, Yueyong
    Wu, Yuanshan
    NEUROCOMPUTING, 2021, 448 : 249 - 262
  • [9] Distributed Penalized Modal Regression for Massive Data
    Jin Jun
    Liu Shuangzhe
    Ma Tiefeng
    JOURNAL OF SYSTEMS SCIENCE & COMPLEXITY, 2023, 36 (02) : 798 - 821
  • [10] Distributed Penalized Modal Regression for Massive Data
    Jun Jin
    Shuangzhe Liu
    Tiefeng Ma
    Journal of Systems Science and Complexity, 2023, 36 : 798 - 821