Big Data, Small Sample

被引:7
|
作者
Gerlovina, Inna [1 ]
van der Laan, Mark J. [2 ]
Hubbard, Alan [3 ]
机构
[1] Univ Calif Berkeley, Div Biostat, 101 Haviland Hall, Berkeley, CA 94720 USA
[2] Univ Calif Berkeley, 101 Haviland Hall, Berkeley, CA 94720 USA
[3] Univ Calif Berkeley, Sch Publ Hlth, Div Biostat, Berkeley, CA 94720 USA
来源
关键词
finite sample inference; hypothesis testing; multiple comparisons; FALSE;
D O I
10.1515/ijb-2017-0012
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Multiple comparisons and small sample size, common characteristics of many types of "Big Data" including those that are produced by genomic studies, present specific challenges that affect reliability of inference. Use of multiple testing procedures necessitates calculation of very small tail probabilities of a test statistic distribution. Results based on large deviation theory provide a formal condition that is necessary to guarantee error rate control given practical sample sizes, linking the number of tests and the sample size; this condition, however, is rarely satisfied. Using methods that are based on Edgeworth expansions (relying especially on the work of Peter Hall), we explore the impact of departures of sampling distributions from typical assumptions on actual error rates. Our investigation illustrates how far the actual error rates can be from the declared nominal levels, suggesting potentially wide-spread problems with error rate control, specifically excessive false positives. This is an important factor that contributes to "reproducibility crisis". We also review some other commonly used methods (such as permutation and methods based on finite sampling inequalities) in their application to multiple testing/small sample data. We point out that Edgeworth expansions, providing higher order approximations to the sampling distribution, offer a promising direction for data analysis that could improve reliability of studies relying on large numbers of comparisons with modest sample sizes.
引用
收藏
页数:6
相关论文
共 50 条
  • [1] Bias Correction in a Small Sample from Big Data
    Lu, Jianguo
    Li, Dingding
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2013, 25 (11) : 2658 - 2663
  • [2] Data Fusion of Small Sample Flying Test Data and Big Sample Simulation Test Data Based on Equivalent Sample for Equipment Efficiency Evaluation
    Ning, Xiaolei
    Wu, Yingxia
    Zhang, Hailin
    Zhao, Xin
    THEORY, METHODOLOGY, TOOLS AND APPLICATIONS FOR MODELING AND SIMULATION OF COMPLEX SYSTEMS, PT IV, 2016, 646 : 543 - 552
  • [3] Small sample sizes: A big data problem in high-dimensional data analysis
    Konietschke, Frank
    Schwab, Karima
    Pauly, Markus
    STATISTICAL METHODS IN MEDICAL RESEARCH, 2021, 30 (03) : 687 - 701
  • [4] Data: Big and Small
    Jones-Schenk, Jan
    JOURNAL OF CONTINUING EDUCATION IN NURSING, 2017, 48 (02): : 60 - 61
  • [5] Small data in the era of big data
    Kitchin, Rob
    Lauriault, Tracey P.
    GEOJOURNAL, 2015, 80 (04) : 463 - 475
  • [6] Big data, small airways, big problems
    Aziz, M.
    BRITISH JOURNAL OF ANAESTHESIA, 2017, 119 (05) : 864 - 866
  • [7] How Small Is Big: Sample Size and Skewness
    Piovesana, Adina
    Senior, Graeme
    ASSESSMENT, 2018, 25 (06) : 793 - 800
  • [8] Big Data for Small Business
    Salih, Sara
    Njenga, Kennedy
    2019 OPEN INNOVATIONS CONFERENCE (OI), 2019, : 268 - 272
  • [9] Small Kids with Big Data
    CHEUNG, Y. F.
    HONG KONG JOURNAL OF PAEDIATRICS, 2022, 27 (03) : 161 - 162
  • [10] Small Data, Big Results
    Arkell, David
    CHEMICAL ENGINEERING PROGRESS, 2019, 115 (11) : 21 - 23