A strategy for validation of variables derived from large-scale electronic health record data

被引:14
|
作者
Liu, Lin [1 ,2 ]
Bustamante, Ranier [2 ]
Earles, Ashley [3 ]
Demb, Joshua [2 ]
Messer, Karen [2 ]
Gupta, Samir [1 ,2 ]
机构
[1] VA San Diego Healthcare Syst, 3500 La Jolla Village Dr, San Diego, CA 92161 USA
[2] Univ Calif San Diego, 9500 Gilman Dr, La Jolla, CA 92093 USA
[3] Vet Med Res Fdn, 3350 La Jolla Village Dr, San Diego, CA 92161 USA
基金
美国国家卫生研究院;
关键词
Electronic phenotyping; Large-scale electronic health records; Data abstraction validation; Sample size; Positive predictive value; Negative predictive value; IDENTIFY PATIENTS; SAMPLE-SIZE; CODING ALGORITHM; DISEASE; ASTHMA;
D O I
10.1016/j.jbi.2021.103879
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Purpose: Standardized approaches for rigorous validation of phenotyping from large-scale electronic health record (EHR) data have not been widely reported. We proposed a methodologically rigorous and efficient approach to guide such validation, including strategies for sampling cases and controls, determining sample sizes, estimating algorithm performance, and terminating the validation process, hereafter referred to as the San Diego Approach to Variable Validation (SDAVV). Methods: We propose sample size formulae which should be used prior to chart review, based on pre-specified critical lower bounds for positive predictive value (PPV) and negative predictive value (NPV). We also propose a stepwise strategy for iterative algorithm development/validation cycles, updating sample sizes for data abstraction until both PPV and NPV achieve target performance. Results: We applied the SDAVV to a Department of Veterans Affairs study in which we created two phenotyping algorithms, one for distinguishing normal colonoscopy cases from abnormal colonoscopy controls and one for identifying aspirin exposure. Estimated PPV and NPV both reached 0.970 with a 95% confidence lower bound of 0.915, estimated sensitivity was 0.963 and specificity was 0.975 for identifying normal colonoscopy cases. The phenotyping algorithm for identifying aspirin exposure reached a PPV of 0.990 (a 95% lower bound of 0.950), an NPV of 0.980 (a 95% lower bound of 0.930), and sensitivity and specificity were 0.960 and 1.000. Conclusions: A structured approach for prospectively developing and validating phenotyping algorithms from large-scale EHR data can be successfully implemented, and should be considered to improve the quality of "big data" research.
引用
收藏
页数:8
相关论文
共 50 条
  • [41] Industrial application of a large-scale dynamic data reconciliation strategy
    Soderstrom, TA
    Edgar, TF
    Russo, LP
    Young, RE
    INDUSTRIAL & ENGINEERING CHEMISTRY RESEARCH, 2000, 39 (06) : 1683 - 1693
  • [42] Data Center Node Recovery Strategy for Large-Scale Faults
    Li, Qian
    Yin, Shan
    Yang, Yuan
    Guo, Bingli
    Li, Xin
    Huang, Shanguo
    2ND INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING, INFORMATION SCIENCE AND INTERNET TECHNOLOGY, CII 2017, 2017, : 328 - 332
  • [43] Scalable Algorithms for Bayesian Inference of Large-Scale Models from Large-Scale Data
    Ghattas, Omar
    Isaac, Tobin
    Petra, Noemi
    Stadler, Georg
    HIGH PERFORMANCE COMPUTING FOR COMPUTATIONAL SCIENCE - VECPAR 2016, 2017, 10150 : 3 - 6
  • [44] Electronic Health Record Implementation Strategy
    Soti, P
    Innovations Through Information Technology, Vols 1 and 2, 2004, : 1206 - 1208
  • [45] Large-scale loyalty card data in health research
    Nevalainen, Jaakko
    Erkkola, Maijaliisa
    Saarijarvi, Hannu
    Nappila, Turkka
    Fogelholm, Mikael
    DIGITAL HEALTH, 2018, 4
  • [46] Outlier Ranking for Large-Scale Public Health Data
    Joshi, Ananya
    Townes, Tina
    Gormley, Nolan
    Neureiter, Luke
    Rosenfeld, Roni
    Wilder, Bryan
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 20, 2024, : 22176 - 22184
  • [47] Recent History of Large-Scale Ecosystem Disturbances in North America Derived from the AVHRR Satellite Record
    Christopher Potter
    Pang-Ning Tan
    Vipin Kumar
    Chris Kucharik
    Steven Klooster
    Vanessa Genovese
    Warren Cohen
    Sean Healey
    Ecosystems, 2005, 8 : 808 - 824
  • [48] Recent history of large-scale ecosystem disturbances in North America derived from the AVHRR satellite record
    Potter, C
    Tan, PN
    Kumar, V
    Kucharik, C
    Klooster, S
    Genovese, V
    Cohen, W
    Healey, S
    ECOSYSTEMS, 2005, 8 (07) : 808 - 824
  • [49] Social and Environmental Variables Obtained from Secondary Data Sources Explain Spatial Trends in Asthma Exacerbations Found in Electronic Health Record (EHR)-Derived Data
    Xie, S.
    Himes, B. E.
    AMERICAN JOURNAL OF RESPIRATORY AND CRITICAL CARE MEDICINE, 2019, 199
  • [50] Can Variables From the Electronic Health Record Identify Delirium at Bedside?
    Khan, Ariba
    Heslin, Kayla
    Simpson, Michelle
    Malone, Michael L.
    JOURNAL OF PATIENT-CENTERED RESEARCH AND REVIEWS, 2022, 9 (03) : 174 - 180