Big Data, Small Sample

被引:7
|
作者
Gerlovina, Inna [1 ]
van der Laan, Mark J. [2 ]
Hubbard, Alan [3 ]
机构
[1] Univ Calif Berkeley, Div Biostat, 101 Haviland Hall, Berkeley, CA 94720 USA
[2] Univ Calif Berkeley, 101 Haviland Hall, Berkeley, CA 94720 USA
[3] Univ Calif Berkeley, Sch Publ Hlth, Div Biostat, Berkeley, CA 94720 USA
来源
关键词
finite sample inference; hypothesis testing; multiple comparisons; FALSE;
D O I
10.1515/ijb-2017-0012
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Multiple comparisons and small sample size, common characteristics of many types of "Big Data" including those that are produced by genomic studies, present specific challenges that affect reliability of inference. Use of multiple testing procedures necessitates calculation of very small tail probabilities of a test statistic distribution. Results based on large deviation theory provide a formal condition that is necessary to guarantee error rate control given practical sample sizes, linking the number of tests and the sample size; this condition, however, is rarely satisfied. Using methods that are based on Edgeworth expansions (relying especially on the work of Peter Hall), we explore the impact of departures of sampling distributions from typical assumptions on actual error rates. Our investigation illustrates how far the actual error rates can be from the declared nominal levels, suggesting potentially wide-spread problems with error rate control, specifically excessive false positives. This is an important factor that contributes to "reproducibility crisis". We also review some other commonly used methods (such as permutation and methods based on finite sampling inequalities) in their application to multiple testing/small sample data. We point out that Edgeworth expansions, providing higher order approximations to the sampling distribution, offer a promising direction for data analysis that could improve reliability of studies relying on large numbers of comparisons with modest sample sizes.
引用
收藏
页数:6
相关论文
共 50 条
  • [21] Making big data small
    Fan, Wenfei
    PROCEEDINGS OF THE ROYAL SOCIETY A-MATHEMATICAL PHYSICAL AND ENGINEERING SCIENCES, 2019, 475 (2225):
  • [22] Small Models for Big Data
    Mistry, Hitesh B.
    Orrell, David
    CLINICAL PHARMACOLOGY & THERAPEUTICS, 2020, 107 (04) : 710 - 711
  • [23] Small Data, Big Impact
    Webster, Dane
    Bukvic, Ivica Ico
    IEEE MULTIMEDIA, 2016, 23 (01) : 6 - 9
  • [24] Adaptive Classification of Big Data Flight Sample
    Liu Fei
    Yin Zhiping
    Huang Qiqing
    Zhang Xiayang
    Liu Jiapeng
    2015 INTERNATIONAL CONFERENCE ON COMPUTER AND COMPUTATIONAL SCIENCES (ICCCS), 2015, : 136 - 141
  • [25] AGRICULTURAL DATA ANALYTICS - SMALL TO BIG DATA
    Ravichandran, S.
    Kareemulla, K.
    INTERNATIONAL JOURNAL OF AGRICULTURAL AND STATISTICAL SCIENCES, 2018, 14 (01): : 211 - 214
  • [26] When small data beats big data
    Faraway, Julian J.
    Augustin, Nicole H.
    STATISTICS & PROBABILITY LETTERS, 2018, 136 : 142 - 145
  • [27] Querying Big Data by Accessing Small Data
    Fan, Wenfei
    Geerts, Floris
    Cao, Yang
    Deng, Ting
    Lu, Ping
    PODS'15: PROCEEDINGS OF THE 33RD ACM SYMPOSIUM ON PRINCIPLES OF DATABASE SYSTEMS, 2015, : 173 - 184
  • [28] Exploring and cleaning big data with random sample data blocks
    Salloum, Salman
    Huang, Joshua Zhexue
    He, Yulin
    JOURNAL OF BIG DATA, 2019, 6 (01)
  • [29] Exploring and cleaning big data with random sample data blocks
    Salman Salloum
    Joshua Zhexue Huang
    Yulin He
    Journal of Big Data, 6
  • [30] Big data analytics in small sample sizes: collective analysis of visual acuity and contrast sensitivity endpoints
    Lesmes, Luis
    Dorr, Michael
    Zhao, Yukai
    Lu, Zhong-Lin
    INVESTIGATIVE OPHTHALMOLOGY & VISUAL SCIENCE, 2023, 64 (08)