Cohort design and natural language processing to reduce bias in electronic health records research

被引:0
|
作者
Shaan Khurshid
Christopher Reeder
Lia X. Harrington
Pulkit Singh
Gopal Sarma
Samuel F. Friedman
Paolo Di Achille
Nathaniel Diamant
Jonathan W. Cunningham
Ashby C. Turner
Emily S. Lau
Julian S. Haimovich
Mostafa A. Al-Alusi
Xin Wang
Marcus D. R. Klarqvist
Jeffrey M. Ashburner
Christian Diedrich
Mercedeh Ghadessi
Johanna Mielke
Hanna M. Eilken
Alice McElhinney
Andrea Derix
Steven J. Atlas
Patrick T. Ellinor
Anthony A. Philippakis
Christopher D. Anderson
Jennifer E. Ho
Puneet Batra
Steven A. Lubitz
机构
[1] Massachusetts General Hospital,Division of Cardiology
[2] Massachusetts General Hospital,Cardiovascular Research Center
[3] Broad Institute of Harvard and the Massachusetts Institute of Technology,Cardiovascular Disease Initiative
[4] Broad Institute of Harvard and the Massachusetts Institute of Technology,Data Sciences Platform
[5] Brigham and Women’s Hospital,Division of Cardiology
[6] Massachusetts General Hospital,Department of Neurology
[7] Massachusetts General Hospital,Henry and Allison McCance Center for Brain Health
[8] Massachusetts General Hospital,Department of Medicine
[9] Harvard Medical School,Division of General Internal Medicine
[10] Massachusetts General Hospital,Bayer AG, Research and Development
[11] Pharmaceuticals,Demoulas Center for Cardiac Arrhythmias
[12] Massachusetts General Hospital,Eric and Wendy Schmidt Center
[13] Broad Institute of Harvard and the Massachusetts Institute of Technology,Center for Genomic Medicine
[14] Massachusetts General Hospital,Department of Neurology
[15] Brigham and Women’s Hospital,undefined
来源
关键词
D O I
暂无
中图分类号
学科分类号
摘要
Electronic health record (EHR) datasets are statistically powerful but are subject to ascertainment bias and missingness. Using the Mass General Brigham multi-institutional EHR, we approximated a community-based cohort by sampling patients receiving longitudinal primary care between 2001-2018 (Community Care Cohort Project [C3PO], n = 520,868). We utilized natural language processing (NLP) to recover vital signs from unstructured notes. We assessed the validity of C3PO by deploying established risk models for myocardial infarction/stroke and atrial fibrillation. We then compared C3PO to Convenience Samples including all individuals from the same EHR with complete data, but without a longitudinal primary care requirement. NLP reduced the missingness of vital signs by 31%. NLP-recovered vital signs were highly correlated with values derived from structured fields (Pearson r range 0.95–0.99). Atrial fibrillation and myocardial infarction/stroke incidence were lower and risk models were better calibrated in C3PO as opposed to the Convenience Samples (calibration error range for myocardial infarction/stroke: 0.012–0.030 in C3PO vs. 0.028–0.046 in Convenience Samples; calibration error for atrial fibrillation 0.028 in C3PO vs. 0.036 in Convenience Samples). Sampling patients receiving regular primary care and using NLP to recover missing data may reduce bias and maximize generalizability of EHR research.
引用
收藏
相关论文
共 50 条
  • [1] Cohort design and natural language processing to reduce bias in electronic health records research
    Khurshid, Shaan
    Reeder, Christopher
    Harrington, Lia X.
    Singh, Pulkit
    Sarma, Gopal
    Friedman, Samuel F.
    Di Achille, Paolo
    Diamant, Nathaniel
    Cunningham, Jonathan W.
    Turner, Ashby C.
    Lau, Emily S.
    Haimovich, Julian S.
    Al-Alusi, Mostafa A.
    Wang, Xin
    Klarqvist, Marcus D. R.
    Ashburner, Jeffrey M.
    Diedrich, Christian
    Ghadessi, Mercedeh
    Mielke, Johanna
    Eilken, Hanna M.
    McElhinney, Alice
    Derix, Andrea
    Atlas, Steven J.
    Ellinor, Patrick T.
    Philippakis, Anthony A.
    Anderson, Christopher D.
    Ho, Jennifer E.
    Batra, Puneet
    Lubitz, Steven A.
    [J]. NPJ DIGITAL MEDICINE, 2022, 5 (01)
  • [2] Using Natural Language Processing to Predict Risk in Electronic Health Records
    Duy Van Le
    Montgomery, James
    Kirkby, Kenneth
    Scanlan, Joel
    [J]. MEDINFO 2023 - THE FUTURE IS ACCESSIBLE, 2024, 310 : 574 - 578
  • [3] Evaluation of a Natural Language Processing Approach to Identify Social Determinants of Health in Electronic Health Records in a Diverse Community Cohort
    Rouillard, Christopher J.
    Nasser, Mahmoud A.
    Hu, Haihong
    Roblin, Douglas W.
    [J]. MEDICAL CARE, 2022, 60 (03) : 248 - 255
  • [4] Applying Natural Language Processing Toolkits to Electronic Health Records - An Experience Report
    Barrett, Neil
    Weber-Jahnke, Jens H.
    [J]. ADVANCES IN INFORMATION TECHNOLOGY AND COMMUNICATION IN HEALTH, 2009, 143 : 441 - 446
  • [5] Natural Language Processing to Identify Lupus Nephritis Phenotype in Electronic Health Records
    Deng, Yu
    Pacheco, Jennifer
    Chung, Anh
    Mao, Chengsheng
    Smith, Joshua
    Zhao, Juan
    Wei, Wei-Qi
    Barnado, April
    Weng, Chunhua
    Liu, Cong
    Gordon, Adam
    Yu, Jingzhi
    Tedla, Yacob
    Kho, Abel
    Ramsey-Goldman, Rosalind
    Walunas, Theresa
    Luo, Yuan
    [J]. ARTHRITIS & RHEUMATOLOGY, 2021, 73 : 666 - 667
  • [6] Natural language processing to identify lupus nephritis phenotype in electronic health records
    Yu Deng
    Jennifer A. Pacheco
    Anika Ghosh
    Anh Chung
    Chengsheng Mao
    Joshua C. Smith
    Juan Zhao
    Wei-Qi Wei
    April Barnado
    Chad Dorn
    Chunhua Weng
    Cong Liu
    Adam Cordon
    Jingzhi Yu
    Yacob Tedla
    Abel Kho
    Rosalind Ramsey-Goldman
    Theresa Walunas
    Yuan Luo
    [J]. BMC Medical Informatics and Decision Making, 22
  • [8] Natural language processing to identify lupus nephritis phenotype in electronic health records
    Deng, Yu
    Pacheco, Jennifer A.
    Ghosh, Anika
    Chung, Anh
    Mao, Chengsheng
    Smith, Joshua C.
    Zhao, Juan
    Wei, Wei-Qi
    Barnado, April
    Dorn, Chad
    Weng, Chunhua
    Liu, Cong
    Cordon, Adam
    Yu, Jingzhi
    Tedla, Yacob
    Kho, Abel
    Ramsey-Goldman, Rosalind
    Walunas, Theresa
    Luo, Yuan
    [J]. BMC MEDICAL INFORMATICS AND DECISION MAKING, 2024, 22 (SUPPL 2)
  • [9] Neural Natural Language Processing for unstructured data in electronic health records: A review
    Li, Irene
    Pan, Jessica
    Goldwasser, Jeremy
    Verma, Neha
    Wong, Wai Pan
    Nuzumlali, Muhammed Yavuz
    Rosand, Benjamin
    Li, Yixin
    Zhang, Matthew
    Chang, David
    Taylor, R. Andrew
    Krumholz, Harlan M.
    Radev, Dragomir
    [J]. COMPUTER SCIENCE REVIEW, 2022, 46
  • [10] Natural language generation for electronic health records
    Scott H. Lee
    [J]. npj Digital Medicine, 1