Optimal subsample selection for massive logistic regression with distributed data

被引:17
|
作者
Zuo, Lulu [1 ]
Zhang, Haixiang [1 ]
Wang, HaiYing [2 ]
Sun, Liuquan [3 ]
机构
[1] Tianjin Univ, Ctr Appl Math, Tianjin 300072, Peoples R China
[2] Univ Connecticut, Dept Stat, Mansfield, CT 06269 USA
[3] Chinese Acad Sci, Acad Math & Syst Sci, Beijing 100190, Peoples R China
基金
中国国家自然科学基金; 美国国家科学基金会;
关键词
Allocation size; Big data; Distributed and massive data; Subsample estimator; Subsampling probabilities; FRAMEWORK; INFERENCE;
D O I
10.1007/s00180-021-01089-0
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
With the emergence of big data, it is increasingly common that the data are distributed. i.e., the data are stored at many distributed sites (machines or nodes) owing to data collection or business operations, etc. We propose a distributed subsampling procedure in such a setting to efficiently approximate the maximum likelihood estimator for the logistic regression. We establish the consistency and asymptotic normality of the subsample estimator given the full data. The optimal subsampling probabilities and optimal allocation sizes are explicitly obtained. We develop a two-step algorithm to approximate the optimal subsampling procedure. Numerical simulations and an application to airline data are presented to evaluate the performance of our subsampling method.
引用
收藏
页码:2535 / 2562
页数:28
相关论文
共 50 条
  • [31] Optimal subsampling for composite quantile regression model in massive data
    Shao, Yujing
    Wang, Lei
    STATISTICAL PAPERS, 2022, 63 (04) : 1139 - 1161
  • [32] Optimal subsampling for composite quantile regression model in massive data
    Yujing Shao
    Lei Wang
    Statistical Papers, 2022, 63 : 1139 - 1161
  • [33] Multinomial logistic regression-based feature selection for hyperspectral data
    Pal, Mahesh
    INTERNATIONAL JOURNAL OF APPLIED EARTH OBSERVATION AND GEOINFORMATION, 2012, 14 (01): : 214 - 220
  • [34] Regularized logistic regression and multiobjective variable selection for classifying MEG data
    Roberto Santana
    Concha Bielza
    Pedro Larrañaga
    Biological Cybernetics, 2012, 106 : 389 - 405
  • [35] Bayesian group selection in logistic regression with application to MRI data analysis
    Lee, Kyoungjae
    Cao Xuan
    BIOMETRICS, 2021, 77 (02) : 391 - 400
  • [36] Regularized logistic regression and multiobjective variable selection for classifying MEG data
    Santana, Roberto
    Bielza, Concha
    Larranaga, Pedro
    BIOLOGICAL CYBERNETICS, 2012, 106 (6-7) : 389 - 405
  • [37] Logistic regression for feature selection and soft classification of remote sensing data
    Cheng, Qi
    Varshney, Pramod K.
    Arora, Manoj K.
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2006, 3 (04) : 491 - 494
  • [38] Robust distributed estimation and variable selection for massive datasets via rank regression
    Jiaming Luan
    Hongwei Wang
    Kangning Wang
    Benle Zhang
    Annals of the Institute of Statistical Mathematics, 2022, 74 : 435 - 450
  • [39] Robust distributed estimation and variable selection for massive datasets via rank regression
    Luan, Jiaming
    Wang, Hongwei
    Wang, Kangning
    Zhang, Benle
    ANNALS OF THE INSTITUTE OF STATISTICAL MATHEMATICS, 2022, 74 (03) : 435 - 450
  • [40] A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification
    Zakariya Yahya Algamal
    Muhammad Hisyam Lee
    Advances in Data Analysis and Classification, 2019, 13 : 753 - 771