Optimal subsample selection for massive logistic regression with distributed data

被引:17
|
作者
Zuo, Lulu [1 ]
Zhang, Haixiang [1 ]
Wang, HaiYing [2 ]
Sun, Liuquan [3 ]
机构
[1] Tianjin Univ, Ctr Appl Math, Tianjin 300072, Peoples R China
[2] Univ Connecticut, Dept Stat, Mansfield, CT 06269 USA
[3] Chinese Acad Sci, Acad Math & Syst Sci, Beijing 100190, Peoples R China
基金
中国国家自然科学基金; 美国国家科学基金会;
关键词
Allocation size; Big data; Distributed and massive data; Subsample estimator; Subsampling probabilities; FRAMEWORK; INFERENCE;
D O I
10.1007/s00180-021-01089-0
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
With the emergence of big data, it is increasingly common that the data are distributed. i.e., the data are stored at many distributed sites (machines or nodes) owing to data collection or business operations, etc. We propose a distributed subsampling procedure in such a setting to efficiently approximate the maximum likelihood estimator for the logistic regression. We establish the consistency and asymptotic normality of the subsample estimator given the full data. The optimal subsampling probabilities and optimal allocation sizes are explicitly obtained. We develop a two-step algorithm to approximate the optimal subsampling procedure. Numerical simulations and an application to airline data are presented to evaluate the performance of our subsampling method.
引用
收藏
页码:2535 / 2562
页数:28
相关论文
共 50 条
  • [21] Adaptive distributed support vector regression of massive data
    Liang, Shu-na
    Sun, Fei
    Zhang, Qi
    COMMUNICATIONS IN STATISTICS-THEORY AND METHODS, 2024, 53 (09) : 3365 - 3382
  • [22] Distributed optimization for penalized regression in massive compositional data
    Chao, Yue
    Huang, Lei
    Ma, Xuejun
    APPLIED MATHEMATICAL MODELLING, 2025, 141
  • [23] Secure and Differentially Private Logistic Regression for Horizontally Distributed Data
    Kim, Miran
    Lee, Junghye
    Ohno-Machado, Lucila
    Jiang, Xiaoqian
    IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2020, 15 (15) : 695 - 710
  • [24] Optimal Feature Selection for Pedestrian Detection based on Logistic Regression Analysis
    Kim, Jonghee
    Lee, Jonghwan
    Lee, Chungsu
    Park, Eunsoo
    Kim, Junmin
    Kim, Hakil
    Lee, Jaeeun
    Jeong, Hoeri
    2013 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC 2013), 2013, : 239 - 242
  • [25] Distributed smoothed rank regression with heterogeneous errors for massive data
    Xiaohui Yuan
    Xinran Zhang
    Yue Wang
    Chunjie Wang
    Journal of the Korean Statistical Society, 2023, 52 : 1078 - 1103
  • [26] Distributed smoothed rank regression with heterogeneous errors for massive data
    Yuan, Xiaohui
    Zhang, Xinran
    Wang, Yue
    Wang, Chunjie
    JOURNAL OF THE KOREAN STATISTICAL SOCIETY, 2023, 52 (04) : 1078 - 1103
  • [27] Differentially private distributed logistic regression using private and public data
    Ji, Zhanglong
    Jiang, Xiaoqian
    Wang, Shuang
    Xiong, Li
    Ohno-Machado, Lucila
    BMC MEDICAL GENOMICS, 2014, 7
  • [28] Differentially private distributed logistic regression using private and public data
    Zhanglong Ji
    Xiaoqian Jiang
    Shuang Wang
    Li Xiong
    Lucila Ohno-Machado
    BMC Medical Genomics, 7
  • [29] Subsample ignorable likelihood for regression analysis with missing data
    Little, Roderick J.
    Zhang, Nanhua
    JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES C-APPLIED STATISTICS, 2011, 60 : 591 - 605
  • [30] Optimal subsampling algorithms for composite quantile regression in massive data
    Jin, Jun
    Liu, Shuangzhe
    Ma, Tiefeng
    STATISTICS, 2023, 57 (04) : 811 - 843