Sample size determination for biomedical big data with limited labels

被引:8
|
作者
Richter, Aaron N. [1 ]
Khoshgoftaar, Taghi M. [1 ]
机构
[1] Florida Atlantic Univ, Dept Comp & Elect Engn & Comp Sci, 777 Glades Rd, Boca Raton, FL 33431 USA
关键词
Sample size determination; Big data; Limited labels; Learning curve; Class imbalance;
D O I
10.1007/s13721-020-0218-0
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
The era of big data has produced vast amounts of information that can be used to build machine learning models. In many cases, however, there is a point where adding more data only marginally increases model performance. This is especially important for scenarios of limited labeled data, as annotation can be expensive and time consuming. If the required sample size for accurate model performance can be determined early, then resources can be allocated appropriately to minimize time and cost. In this study, we explore sample size determination methods for four real-world biomedical datasets, spanning genomics, proteomics, electronic health records, and insurance claims data, all with millions of instances each and<2% class ratio. The methods used involve approximating a learning curve for a large amount of data using a small amount of data. We evaluate an existing method that fits an inverse power law model to a small learning curve and introduce a novel semi-supervised method that utilizes the large amount of unlabeled data for estimating a learning curve. We find that the inverse power law method is applicable to big data, while the semi-supervised method can be better at detecting convergence. To the best of our knowledge, this is the first study to apply an inverse power law curve fitting method to big data with limited labels and compare it to a semi-supervised approach.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] Sample size determination for biomedical big data with limited labels
    Aaron N. Richter
    Taghi M. Khoshgoftaar
    Network Modeling Analysis in Health Informatics and Bioinformatics, 2020, 9
  • [2] Approximating Learning Curves for Imbalanced Big Data with Limited Labels
    Richter, Aaron N.
    Khoshgoftaar, Taghi M.
    2019 IEEE 31ST INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2019), 2019, : 237 - 242
  • [3] Sentiment analysis on big sparse data streams with limited labels
    Vasileios Iosifidis
    Eirini Ntoutsi
    Knowledge and Information Systems, 2020, 62 : 1393 - 1432
  • [4] Sentiment analysis on big sparse data streams with limited labels
    Iosifidis, Vasileios
    Ntoutsi, Eirini
    KNOWLEDGE AND INFORMATION SYSTEMS, 2020, 62 (04) : 1393 - 1432
  • [5] Sample Size determination for Censored Data
    Asghar, Naseem
    Khalil, Umair
    Khan, Dost Muhammad
    Khan, Zardad
    Din, Iftikhar Ud
    INTERNATIONAL JOURNAL OF AYURVEDIC MEDICINE, 2021, 12 (02) : 267 - 269
  • [6] Sample size determination for clustered count data
    Amatya, Anup
    Bhaumik, Dulal
    Gibbons, Robert D.
    STATISTICS IN MEDICINE, 2013, 32 (24) : 4162 - 4179
  • [7] Big Sample Size, Big Results
    不详
    CELL, 2010, 143 (02) : 177 - 177
  • [8] Sample size determination for multidimensional parameters and the A-optimal subsampling in a big data linear regression model
    Zhang, Sheng
    Tan, Fei
    Peng, Hanxiang
    JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION, 2025, 95 (03) : 628 - 653
  • [9] Sample size and power determination when limited preliminary information is available
    Christine E. McLaren
    Wen-Pin Chen
    Thomas D. O’Sullivan
    Daniel L. Gillen
    Min-Ying Su
    Jeon H. Chen
    Bruce J. Tromberg
    BMC Medical Research Methodology, 17
  • [10] Sample size and power determination when limited preliminary information is available
    McLaren, Christine E.
    Chen, Wen-Pin
    O'Sullivan, Thomas D.
    Gillen, Daniel L.
    Su, Min-Ying
    Chen, Jeon H.
    Tromberg, Bruce J.
    BMC MEDICAL RESEARCH METHODOLOGY, 2017, 17