A strategy for validation of variables derived from large-scale electronic health record data

被引:14
|
作者
Liu, Lin [1 ,2 ]
Bustamante, Ranier [2 ]
Earles, Ashley [3 ]
Demb, Joshua [2 ]
Messer, Karen [2 ]
Gupta, Samir [1 ,2 ]
机构
[1] VA San Diego Healthcare Syst, 3500 La Jolla Village Dr, San Diego, CA 92161 USA
[2] Univ Calif San Diego, 9500 Gilman Dr, La Jolla, CA 92093 USA
[3] Vet Med Res Fdn, 3350 La Jolla Village Dr, San Diego, CA 92161 USA
基金
美国国家卫生研究院;
关键词
Electronic phenotyping; Large-scale electronic health records; Data abstraction validation; Sample size; Positive predictive value; Negative predictive value; IDENTIFY PATIENTS; SAMPLE-SIZE; CODING ALGORITHM; DISEASE; ASTHMA;
D O I
10.1016/j.jbi.2021.103879
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Purpose: Standardized approaches for rigorous validation of phenotyping from large-scale electronic health record (EHR) data have not been widely reported. We proposed a methodologically rigorous and efficient approach to guide such validation, including strategies for sampling cases and controls, determining sample sizes, estimating algorithm performance, and terminating the validation process, hereafter referred to as the San Diego Approach to Variable Validation (SDAVV). Methods: We propose sample size formulae which should be used prior to chart review, based on pre-specified critical lower bounds for positive predictive value (PPV) and negative predictive value (NPV). We also propose a stepwise strategy for iterative algorithm development/validation cycles, updating sample sizes for data abstraction until both PPV and NPV achieve target performance. Results: We applied the SDAVV to a Department of Veterans Affairs study in which we created two phenotyping algorithms, one for distinguishing normal colonoscopy cases from abnormal colonoscopy controls and one for identifying aspirin exposure. Estimated PPV and NPV both reached 0.970 with a 95% confidence lower bound of 0.915, estimated sensitivity was 0.963 and specificity was 0.975 for identifying normal colonoscopy cases. The phenotyping algorithm for identifying aspirin exposure reached a PPV of 0.990 (a 95% lower bound of 0.950), an NPV of 0.980 (a 95% lower bound of 0.930), and sensitivity and specificity were 0.960 and 1.000. Conclusions: A structured approach for prospectively developing and validating phenotyping algorithms from large-scale EHR data can be successfully implemented, and should be considered to improve the quality of "big data" research.
引用
收藏
页数:8
相关论文
共 50 条
  • [1] A strategy for validation of variables derived from large-scale electronic health record data
    Liu, Lin
    Bustamante, Ranier
    Earles, Ashley
    Demb, Joshua
    Messer, Karen
    Gupta, Samir
    Journal of Biomedical Informatics, 2021, 121
  • [2] FEASIBILITY AND VALIDATION OF LARGE-SCALE DATA ACQUISITION FROM THE ELECTRONIC HEALTH RECORD TO A SECURE RESEARCH DATABASE FOR NEPHROLITHIASIS
    Sui, Wilson
    Calvert, Joshua K.
    Kavoussi, Nicholas L.
    Lewis, Adam
    Miller, Nicole L.
    Bejan, Cosmin A.
    His, Ryan S.
    JOURNAL OF UROLOGY, 2020, 203 : E717 - E717
  • [3] A regression framework to uncover pleiotropy in large-scale electronic health record data
    Li, Ruowang
    Duan, Rui
    Kember, Rachel L.
    Rader, Daniel J.
    Damrauer, Scott M.
    Moore, Jason H.
    Chen, Yong
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2019, 26 (10) : 1083 - 1090
  • [4] Ascertainment of Aspirin Exposure Using Structured and Unstructured Large-scale Electronic Health Record Data
    Bustamante, Ranier
    Earles, Ashley
    Murphy, James D.
    Bryant, Alex K.
    Patterson, Olga V.
    Gawron, Andrew J.
    Kaltenbach, Tonya
    Whooley, Mary A.
    Fisher, Deborah A.
    Saini, Sameer D.
    Gupta, Samir
    Liu, Lin
    MEDICAL CARE, 2019, 57 (10) : E60 - E64
  • [5] Structured Approach for Evaluating Strategies for Cancer Ascertainment Using Large-Scale Electronic Health Record Data
    Earles, Ashley
    Liu, Lin
    Bustamante, Ranier
    Coke, Pat
    Lynch, Julie
    Messer, Karen
    Martinez, Maria Elena
    Murphy, James D.
    Williams, Christina D.
    Fisher, Deborah A.
    Provenzale, Dawn T.
    Gawron, Andrew J.
    Kaltenbach, Tonya
    Gupta, Samir
    JCO CLINICAL CANCER INFORMATICS, 2018, 2 : 1 - 12
  • [6] Supervised Multi-Specialist Topic Model With Applications on Large-Scale Electronic Health Record Data
    Song, Ziyang
    Toral, Xavier Sumba
    Xu, Yixin
    Liu, Aihua
    Guo, Liming
    Powell, Guido
    Verma, Aman
    Buckeridge, David
    Marelli, Ariane
    Li, Yue
    12TH ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY, AND HEALTH INFORMATICS (ACM-BCB 2021), 2021,
  • [7] LEVERAGING LARGE-SCALE ELECTRONIC HEALTH RECORD (EHR) DATA TO IMPROVE QUALITY MEASUREMENT IN CHILD MENTAL HEALTH CARE
    Ramtekkar, Ujjwal
    JOURNAL OF THE AMERICAN ACADEMY OF CHILD AND ADOLESCENT PSYCHIATRY, 2019, 58 (10): : S68 - S68
  • [8] CREATING A LARGE-SCALE PHYSICALLY INTEGRATED ELECTRONIC HEALTH RECORD DATA SYSTEM TO SUPPORT A LEARNING HEALTHCARE SYSTEM
    Dore, D. D.
    Ciofani, D.
    Davis, S.
    Nunes, A. P.
    Bradley, J. M.
    Seeger, J. D.
    Berger, M.
    VALUE IN HEALTH, 2017, 20 (05) : A321 - A322
  • [9] Stratifying risk for dementia onset using large-scale electronic health record data: A retrospective cohort study
    McCoy, Thomas H., Jr.
    Han, Larry
    Pellegrini, Amelia M.
    Tanzi, Rudolph E.
    Berretta, Sabina
    Perlis, Roy H.
    ALZHEIMERS & DEMENTIA, 2020, 16 (03) : 531 - 540
  • [10] Effects of Antidepressants on COVID-19 Outcomes: Retrospective Study on Large-Scale Electronic Health Record Data
    Rahman, Mahmudur
    Mahi, Atqiya Munawara
    Melamed, Rachel
    Alam, Mohammad Arif Ul
    INTERACTIVE JOURNAL OF MEDICAL RESEARCH, 2023, 12