Optimal subsample selection for massive logistic regression with distributed data

被引:17
|
作者
Zuo, Lulu [1 ]
Zhang, Haixiang [1 ]
Wang, HaiYing [2 ]
Sun, Liuquan [3 ]
机构
[1] Tianjin Univ, Ctr Appl Math, Tianjin 300072, Peoples R China
[2] Univ Connecticut, Dept Stat, Mansfield, CT 06269 USA
[3] Chinese Acad Sci, Acad Math & Syst Sci, Beijing 100190, Peoples R China
基金
中国国家自然科学基金; 美国国家科学基金会;
关键词
Allocation size; Big data; Distributed and massive data; Subsample estimator; Subsampling probabilities; FRAMEWORK; INFERENCE;
D O I
10.1007/s00180-021-01089-0
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
With the emergence of big data, it is increasingly common that the data are distributed. i.e., the data are stored at many distributed sites (machines or nodes) owing to data collection or business operations, etc. We propose a distributed subsampling procedure in such a setting to efficiently approximate the maximum likelihood estimator for the logistic regression. We establish the consistency and asymptotic normality of the subsample estimator given the full data. The optimal subsampling probabilities and optimal allocation sizes are explicitly obtained. We develop a two-step algorithm to approximate the optimal subsampling procedure. Numerical simulations and an application to airline data are presented to evaluate the performance of our subsampling method.
引用
收藏
页码:2535 / 2562
页数:28
相关论文
共 50 条
  • [41] A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification
    Algamal, Zakariya Yahya
    Lee, Muhammad Hisyam
    ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 2019, 13 (03) : 753 - 771
  • [42] Model Selection for Logistic Regression Models
    Duller, Christine
    NUMERICAL ANALYSIS AND APPLIED MATHEMATICS (ICNAAM 2012), VOLS A AND B, 2012, 1479 : 414 - 416
  • [43] Purposeful selection of variables in logistic regression
    Bursac, Zoran
    Gauss, C. Heath
    Williams, David Keith
    Hosmer, David W.
    SOURCE CODE FOR BIOLOGY AND MEDICINE, 2008, 3 (01):
  • [44] Bayesian variable selection for logistic regression
    Tian, Yiqing
    Bondell, Howard D.
    Wilson, Alyson
    STATISTICAL ANALYSIS AND DATA MINING, 2019, 12 (05) : 378 - 393
  • [45] Variable selection for sparse logistic regression
    Zanhua Yin
    Metrika, 2020, 83 : 821 - 836
  • [46] Variable selection in logistic regression models
    Zellner, D
    Keller, F
    Zellner, GE
    COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2004, 33 (03) : 787 - 805
  • [47] Ensemble Logistic Regression for Feature Selection
    Zakharov, Roman
    Dupont, Pierre
    PATTERN RECOGNITION IN BIOINFORMATICS, 2011, 7036 : 133 - 144
  • [48] Variable Selection in Logistic Regression Model
    ZHANG Shangli
    ZHANG Lili
    QIU Kuanmin
    LU Ying
    CAI Baigen
    ChineseJournalofElectronics, 2015, 24 (04) : 813 - 817
  • [49] Variable selection for sparse logistic regression
    Yin, Zanhua
    METRIKA, 2020, 83 (07) : 821 - 836
  • [50] Variable Selection in Logistic Regression Model
    Zhang Shangli
    Zhang Lili
    Qiu Kuanmin
    Lu Ying
    Cai Baigen
    CHINESE JOURNAL OF ELECTRONICS, 2015, 24 (04) : 813 - 817