SUBSAMPLING AND JACKKNIFING: A PRACTICALLY CONVENIENT SOLUTION FOR LARGE DATA ANALYSIS WITH LIMITED COMPUTATIONAL RESOURCES

被引:5
|
作者
Wu, Shuyuan [1 ]
Zhu, Xuening [2 ,3 ]
Wang, Hansheng [1 ]
机构
[1] Peking Univ, Beijing, Peoples R China
[2] Fudan Univ, Shanghai, Peoples R China
[3] Fudan Univ, Sch Data Sci, Shanghai 200433, Peoples R China
基金
中国国家自然科学基金;
关键词
GPU; jackknife; large dataset; subsampling; IMAGE QUALITY ASSESSMENT;
D O I
10.5705/ss.202021.0257
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Modern statistical analysis often involves large data sets, for which conventional estimation methods are not suitable, owing to limited computational resources. To solve this problem, we propose a novel subsampling-based method with jackknifing. The key idea is to treat the whole sample as if it were the population. Then, we obtain multiple subsamples with greatly reduced sizes using simple random sampling with replacement. We do not recommend sampling methods without replacement, because this would incur a significant data processing cost when the processing occurs on a hard drive. However, such a cost does not exist if the data are processed in memory. Because subsampled data have relatively small sizes, they can be comfortably read into computer memory and processed. Based on subsampled data sets, jackknife-debiased estimators can be obtained for the target parameter. The resulting estimators are statistically consistent, with an extremely small bias. Finally, the jackknife-debiased estimators from different subsamples are averaged to form the final estimator. We show theoretically that the final estimator is consistent and asymptotically normal. Furthermore, its asymptotic statistical efficiency can be as good as that of the whole sample estimator under very mild conditions. The proposed method is easily implemented on most computer systems, and thus is widely applicable.
引用
收藏
页码:2041 / 2064
页数:24
相关论文
共 50 条
  • [1] Sensitivity Analysis of Mesh Warping and Subsampling Strategies for Generating Large Scale Electrophysiological Simulation Data
    Hoogendoorn, Corne
    Pashaei, Ali
    Sebastian, Rafael
    Sukno, Federico M.
    Camara, Oscar
    Frangi, Alejandro F.
    FUNCTIONAL IMAGING AND MODELING OF THE HEART, 2011, 6666 : 418 - 426
  • [2] Convergence Analysis of Time-Optimal Model Predictive Control under Limited Computational Resources
    Roesmann, Christoph
    Hoffmann, Frank
    Bertram, Torsten
    2016 EUROPEAN CONTROL CONFERENCE (ECC), 2016, : 465 - 470
  • [3] Autonomous Compact Monitoring of Large Areas Using Micro Aerial Vehicles with Limited Sensory Information and Computational Resources
    Jeske, Petr
    Kloucek, Stepan
    Saska, Martin
    MODELLING AND SIMULATION FOR AUTONOMOUS SYSTEMS (MESAS 2018), 2019, 11472 : 158 - 171
  • [4] Resource-rational analysis: Understanding human cognition as the optimal use of limited computational resources
    Lieder, Falk
    Griffiths, Thomas L.
    BEHAVIORAL AND BRAIN SCIENCES, 2020, 43
  • [5] A Random Matrix Analysis of Data Stream Clustering: CopingWith Limited Memory Resources
    Lebeau, Hugo
    Couillet, Romain
    Chatelain, Florent
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [6] Computational solutions to large-scale data management and analysis
    Eric E. Schadt
    Michael D. Linderman
    Jon Sorenson
    Lawrence Lee
    Garry P. Nolan
    Nature Reviews Genetics, 2010, 11 : 647 - 657
  • [7] A QUEUING ANALYSIS OF DATA-NETWORKS WITH LIMITED RESOURCES - AN APPROACH TO PERFORMANCE ANALYSIS OF ROUTING TECHNIQUES
    UHL, T
    AEU-ARCHIV FUR ELEKTRONIK UND UBERTRAGUNGSTECHNIK-INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATIONS, 1993, 47 (02): : 91 - 97
  • [8] A Computational Framework for Integrative Analysis of Large Microbial Genomics Data
    Zeng, Erliang
    Zhang, Wei
    Emrich, Scott
    Liu, Dan
    Livermore, Josh
    Jones, Stuart
    PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE, 2015, : 1109 - 1116
  • [9] Computational solutions to large-scale data management and analysis
    Schadt, Eric E.
    Linderman, Michael D.
    Sorenson, Jon
    Lee, Lawrence
    Nolan, Garry P.
    NATURE REVIEWS GENETICS, 2010, 11 (09) : 647 - 657
  • [10] Virtual Fully-Connected Layer: Training a Large-Scale Face Recognition Dataset with Limited Computational Resources
    Li, Pengyu
    Wang, Biao
    Zhang, Lei
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 13310 - 13319