Sample size determination for biomedical big data with limited labels

被引:8
|
作者
Richter, Aaron N. [1 ]
Khoshgoftaar, Taghi M. [1 ]
机构
[1] Florida Atlantic Univ, Dept Comp & Elect Engn & Comp Sci, 777 Glades Rd, Boca Raton, FL 33431 USA
关键词
Sample size determination; Big data; Limited labels; Learning curve; Class imbalance;
D O I
10.1007/s13721-020-0218-0
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
The era of big data has produced vast amounts of information that can be used to build machine learning models. In many cases, however, there is a point where adding more data only marginally increases model performance. This is especially important for scenarios of limited labeled data, as annotation can be expensive and time consuming. If the required sample size for accurate model performance can be determined early, then resources can be allocated appropriately to minimize time and cost. In this study, we explore sample size determination methods for four real-world biomedical datasets, spanning genomics, proteomics, electronic health records, and insurance claims data, all with millions of instances each and<2% class ratio. The methods used involve approximating a learning curve for a large amount of data using a small amount of data. We evaluate an existing method that fits an inverse power law model to a small learning curve and introduce a novel semi-supervised method that utilizes the large amount of unlabeled data for estimating a learning curve. We find that the inverse power law method is applicable to big data, while the semi-supervised method can be better at detecting convergence. To the best of our knowledge, this is the first study to apply an inverse power law curve fitting method to big data with limited labels and compare it to a semi-supervised approach.
引用
收藏
页数:13
相关论文
共 50 条
  • [31] Density estimation-based method to determine sample size for random sample partition of big data
    He, Yulin
    Chen, Jiaqi
    Shen, Jiaxing
    Fournier-Viger, Philippe
    Huang, Joshua Zhexue
    FRONTIERS OF COMPUTER SCIENCE, 2024, 18 (05)
  • [32] The Ethics of Biomedical ‘Big Data’ Analytics
    Brent Mittelstadt
    Philosophy & Technology, 2019, 32 (1) : 17 - 21
  • [33] Bioinformaticians wrestling with the big biomedical data
    Xue, Yu
    Wang, Xiu-Jie
    JOURNAL OF GENETICS AND GENOMICS, 2017, 44 (05) : 223 - 225
  • [34] Big Data Privacy in Biomedical Research
    Wang, Shuang
    Bonomi, Luca
    Dai, Wenrui
    Chen, Feng
    Cheung, Cynthia
    Bloss, Cinnamon S.
    Cheng, Samuel
    Jiang, Xiaoqian
    IEEE TRANSACTIONS ON BIG DATA, 2020, 6 (02) : 296 - 308
  • [35] Bioinformaticians wrestling with the big biomedical data
    Yu Xue
    Xiu-Jie Wang
    Journal of Genetics and Genomics, 2017, 44 (05) : 223 - 225
  • [36] Big Data, Small Sample
    Gerlovina, Inna
    van der Laan, Mark J.
    Hubbard, Alan
    INTERNATIONAL JOURNAL OF BIOSTATISTICS, 2017, 13 (01):
  • [37] Superior Parallel Big Data Clustering Through Competitive Stochastic Sample Size Optimization in Big-Means
    Mussabayev, Rustam
    Mussabayev, Ravil
    INTELLIGENT INFORMATION AND DATABASE SYSTEMS, PT II, ACIIDS 2024, 2024, 14796 : 224 - 236
  • [38] Big Complex Biomedical Data: Towards a Taxonomy of Data
    Holzinger, Andreas
    Stocker, Christof
    Dehmer, Matthias
    E-BUSINESS AND TELECOMMUNICATIONS, ICETE 2012, 2014, 455 : 3 - 18
  • [39] Big data in nanoscale connectomics, and the greed for training labels
    Motta, Alessandro
    Schurr, Meike
    Staffler, Benedikt
    Helmstaedter, Moritz
    CURRENT OPINION IN NEUROBIOLOGY, 2019, 55 : 180 - 187
  • [40] Sample size and power determination in joint modeling of longitudinal and survival data
    Chen, Liddy M.
    Ibrahim, Joseph G.
    Chu, Haitao
    STATISTICS IN MEDICINE, 2011, 30 (18) : 2295 - 2309