A strategy for validation of variables derived from large-scale electronic health record data

被引:14
|
作者
Liu, Lin [1 ,2 ]
Bustamante, Ranier [2 ]
Earles, Ashley [3 ]
Demb, Joshua [2 ]
Messer, Karen [2 ]
Gupta, Samir [1 ,2 ]
机构
[1] VA San Diego Healthcare Syst, 3500 La Jolla Village Dr, San Diego, CA 92161 USA
[2] Univ Calif San Diego, 9500 Gilman Dr, La Jolla, CA 92093 USA
[3] Vet Med Res Fdn, 3350 La Jolla Village Dr, San Diego, CA 92161 USA
基金
美国国家卫生研究院;
关键词
Electronic phenotyping; Large-scale electronic health records; Data abstraction validation; Sample size; Positive predictive value; Negative predictive value; IDENTIFY PATIENTS; SAMPLE-SIZE; CODING ALGORITHM; DISEASE; ASTHMA;
D O I
10.1016/j.jbi.2021.103879
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Purpose: Standardized approaches for rigorous validation of phenotyping from large-scale electronic health record (EHR) data have not been widely reported. We proposed a methodologically rigorous and efficient approach to guide such validation, including strategies for sampling cases and controls, determining sample sizes, estimating algorithm performance, and terminating the validation process, hereafter referred to as the San Diego Approach to Variable Validation (SDAVV). Methods: We propose sample size formulae which should be used prior to chart review, based on pre-specified critical lower bounds for positive predictive value (PPV) and negative predictive value (NPV). We also propose a stepwise strategy for iterative algorithm development/validation cycles, updating sample sizes for data abstraction until both PPV and NPV achieve target performance. Results: We applied the SDAVV to a Department of Veterans Affairs study in which we created two phenotyping algorithms, one for distinguishing normal colonoscopy cases from abnormal colonoscopy controls and one for identifying aspirin exposure. Estimated PPV and NPV both reached 0.970 with a 95% confidence lower bound of 0.915, estimated sensitivity was 0.963 and specificity was 0.975 for identifying normal colonoscopy cases. The phenotyping algorithm for identifying aspirin exposure reached a PPV of 0.990 (a 95% lower bound of 0.950), an NPV of 0.980 (a 95% lower bound of 0.930), and sensitivity and specificity were 0.960 and 1.000. Conclusions: A structured approach for prospectively developing and validating phenotyping algorithms from large-scale EHR data can be successfully implemented, and should be considered to improve the quality of "big data" research.
引用
收藏
页数:8
相关论文
共 50 条
  • [21] Validation of electronic health record derived COPD exacerbations using randomised clinical trial data
    Sperrin, Matthew
    Webb, David J.
    Patel, Pinal
    Davis, Kourtney J.
    Collier, Susan
    Pate, Alexander
    Leather, Dave
    Pimenta, Jeanne M.
    PHARMACOEPIDEMIOLOGY AND DRUG SAFETY, 2018, 27 : 130 - 131
  • [22] Agreement and validity of electronic health record prescribing data relative to pharmacy claims data: A validation study from a US electronic health record database
    Rowan, Christopher G.
    Flory, James
    Gerhard, Tobias
    Cuddeback, John K.
    Stempniewicz, Nikita
    Lewis, James D.
    Hennessy, Sean
    PHARMACOEPIDEMIOLOGY AND DRUG SAFETY, 2017, 26 (08) : 963 - 972
  • [23] Deriving A Novel Health Index Using A Large-Scale Population Based Electronic Health Record With Deep Networks
    Hung, Chen-Ying
    Chen, Huan-Yu
    Wee, Lawrence J. K.
    Lin, Ching-Heng
    Lee, Chi-Chun
    42ND ANNUAL INTERNATIONAL CONFERENCES OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY: ENABLING INNOVATIVE TECHNOLOGIES FOR GLOBAL HEALTHCARE EMBC'20, 2020, : 5872 - 5875
  • [24] Large-Scale Generation and Validation of Synthetic PMU Data
    Idehen, Ikponmwosa
    Jang, Wonhyeok
    Overbye, Thomas J.
    IEEE TRANSACTIONS ON SMART GRID, 2020, 11 (05) : 4290 - 4298
  • [25] Agency Satisfaction With Electronic Record Management Systems: A Large-Scale Survey
    Hu, Paul Jen-Hwa
    Hsu, Fang-Ming
    Hu, Han-fen
    Chen, Hsunchun
    JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2010, 61 (12): : 2559 - 2574
  • [26] A Genotype Validated Bimodal Method for the Large-Scale Identification and Phenotyping of Persons with Sickle Cell Disease Using Electronic Health Record Data
    Wuichet, Kristin
    Takemoto, Clifford M.
    Cronin, Robert
    Barton, Martha
    Chen, Pei-Lin
    Saraf, Santosh L.
    Weiss, Mitchell J.
    DeBaun, Michael R.
    BLOOD, 2023, 142
  • [27] Parallel Strategy for the Large-Scale Data Streams Processing
    Yuan, Ya-Juan
    Ma, Guo-Jie
    PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING AND INFORMATION SYSTEMS, 2016, 52 : 232 - 234
  • [28] An Efficient Strategy for Large-Scale CORS Data Processing
    Xiong, Bolin
    Huang, Dingfa
    CHINA SATELLITE NAVIGATION CONFERENCE (CSNC) 2016 PROCEEDINGS, VOL I, 2016, 388 : 213 - 225
  • [29] Association of mental health diagnosis with race and all-cause mortality after a cancer diagnosis: Large-scale analysis of electronic health record data
    Chen, William C.
    Boreta, Lauren
    Braunstein, Steve E.
    Rabow, Michael W.
    Kaplan, Lawrence E.
    Tenenbaum, Jessica D.
    Morin, Olivier
    Park, Catherine C.
    Hong, Julian C.
    CANCER, 2022, 128 (02) : 344 - 352
  • [30] VALIDATION OF A NATURAL LANGUAGE PROCESSING METHOD FOR PHENOTYPING KIDNEY STONE COMPOSITION FROM LARGE-SCALE ELECTRONIC HEALTH RECORDS
    Hsi, Ryan
    Lee, Daniel
    Xu, Yaomin
    Bejan, Cosmin
    JOURNAL OF UROLOGY, 2019, 201 (04): : E845 - E845