Comparing natural language processing representations of coded disease sequences for prediction in electronic health records

被引:2
|
作者
Beaney, Thomas [1 ,2 ,4 ]
Jha, Sneha [2 ]
Alaa, Asem [2 ]
Smith, Alexander [3 ]
Clarke, Jonathan [2 ]
Woodcock, Thomas [1 ]
Majeed, Azeem [1 ]
Aylin, Paul [1 ]
Barahona, Mauricio [2 ]
机构
[1] Imperial Coll London, Dept Primary Care & Publ Hlth, London W12 0BZ, England
[2] Imperial Coll London, Ctr Math Precis Healthcare, Dept Math, London SW7 2AZ, England
[3] Imperial Coll London, Dept Epidemiol & Biostat, London W2 1PG, England
[4] Imperial Coll London, Sch Publ Hlth, Dept Primary Care & Publ Hlth, 90 Wood Lane, London W12 0BZ, England
关键词
Multiple Long Term Conditions; Long-Term Conditions; diseases; representations; prediction; natural language processing; MULTIMORBIDITY; MAP;
D O I
10.1093/jamia/ocae091
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objective Natural language processing (NLP) algorithms are increasingly being applied to obtain unsupervised representations of electronic health record (EHR) data, but their comparative performance at predicting clinical endpoints remains unclear. Our objective was to compare the performance of unsupervised representations of sequences of disease codes generated by bag-of-words versus sequence-based NLP algorithms at predicting clinically relevant outcomes.Materials and Methods This cohort study used primary care EHRs from 6 286 233 people with Multiple Long-Term Conditions in England. For each patient, an unsupervised vector representation of their time-ordered sequences of diseases was generated using 2 input strategies (212 disease categories versus 9462 diagnostic codes) and different NLP algorithms (Latent Dirichlet Allocation, doc2vec, and 2 transformer models designed for EHRs). We also developed a transformer architecture, named EHR-BERT, incorporating sociodemographic information. We compared the performance of each of these representations (without fine-tuning) as inputs into a logistic classifier to predict 1-year mortality, healthcare use, and new disease diagnosis.Results Patient representations generated by sequence-based algorithms performed consistently better than bag-of-words methods in predicting clinical endpoints, with the highest performance for EHR-BERT across all tasks, although the absolute improvement was small. Representations generated using disease categories perform similarly to those using diagnostic codes as inputs, suggesting models can equally manage smaller or larger vocabularies for prediction of these outcomes.Discussion and Conclusion Patient representations produced by sequence-based NLP algorithms from sequences of disease codes demonstrate improved predictive content for patient outcomes compared with representations generated by co-occurrence-based algorithms. This suggests transformer models may be useful for generating multi-purpose representations, even without fine-tuning.
引用
收藏
页码:1451 / 1462
页数:12
相关论文
共 50 条
  • [31] Colonoscopy quality, quality measures, and a natural language processing tool for electronic health records
    Deutsch, John C.
    [J]. GASTROINTESTINAL ENDOSCOPY, 2012, 75 (06) : 1240 - 1242
  • [32] Ascertainment of Delirium Status Using Natural Language Processing From Electronic Health Records
    Fu, Sunyang
    Lopes, Guilherme S.
    Pagali, Sandeep R.
    Thorsteinsdottir, Bjoerg
    LeBrasseur, Nathan K.
    Wen, Andrew
    Liu, Hongfang
    Rocca, Walter A.
    Olson, Janet E.
    St Sauver, Jennifer
    Sohn, Sunghwan
    [J]. JOURNALS OF GERONTOLOGY SERIES A-BIOLOGICAL SCIENCES AND MEDICAL SCIENCES, 2022, 77 (03): : 524 - 530
  • [33] Cohort design and natural language processing to reduce bias in electronic health records research
    Shaan Khurshid
    Christopher Reeder
    Lia X. Harrington
    Pulkit Singh
    Gopal Sarma
    Samuel F. Friedman
    Paolo Di Achille
    Nathaniel Diamant
    Jonathan W. Cunningham
    Ashby C. Turner
    Emily S. Lau
    Julian S. Haimovich
    Mostafa A. Al-Alusi
    Xin Wang
    Marcus D. R. Klarqvist
    Jeffrey M. Ashburner
    Christian Diedrich
    Mercedeh Ghadessi
    Johanna Mielke
    Hanna M. Eilken
    Alice McElhinney
    Andrea Derix
    Steven J. Atlas
    Patrick T. Ellinor
    Anthony A. Philippakis
    Christopher D. Anderson
    Jennifer E. Ho
    Puneet Batra
    Steven A. Lubitz
    [J]. npj Digital Medicine, 5
  • [34] Natural language processing to identify social determinants of health in Alzheimer's disease and related dementia from electronic health records
    Wu, Wenbo
    Holkeboer, Kaes J.
    Kolawole, Temidun O.
    Carbone, Lorrie
    Mahmoudi, Elham
    [J]. HEALTH SERVICES RESEARCH, 2023, 58 (06) : 1292 - 1302
  • [35] Natural language processing of multi-hospital electronic health records for public health surveillance of suicidality
    Romain Bey
    Ariel Cohen
    Vincent Trebossen
    Basile Dura
    Pierre-Alexis Geoffroy
    Charline Jean
    Benjamin Landman
    Thomas Petit-Jean
    Gilles Chatellier
    Kankoe Sallah
    Xavier Tannier
    Aurelie Bourmaud
    Richard Delorme
    [J]. npj Mental Health Research, 3 (1):
  • [36] Identification of recurrent atrial fibrillation using natural language processing applied to electronic health records
    Zheng, Chengyi
    Lee, Ming-sum
    Bansal, Nisha
    Go, Alan S.
    Chen, Cheng
    Harrison, Teresa N.
    Fan, Dongjie
    Allen, Amanda
    Garcia, Elisha
    Lidgard, Ben
    Singer, Daniel
    An, Jaejin
    [J]. EUROPEAN HEART JOURNAL-QUALITY OF CARE AND CLINICAL OUTCOMES, 2024, 10 (01) : 77 - 88
  • [37] Med7: A transferable clinical natural language processing model for electronic health records
    Kormilitzin, Andrey
    Vaci, Nemanja
    Liu, Qiang
    Nevado-Holgado, Alejo
    [J]. ARTIFICIAL INTELLIGENCE IN MEDICINE, 2021, 118
  • [38] ARTERIAL: A Natural Language Processing Model for Prevention of Information Leakage from Electronic Health Records
    Goldschmidt, Guilherme
    Zeiser, Felipe Andre
    Righi, Rodrigo da Rosa
    da Costa, Cristiano Andre
    [J]. 2023 XIII BRAZILIAN SYMPOSIUM ON COMPUTING SYSTEMS ENGINEERING, SBESC, 2023,
  • [39] Validation of Phenotyping Algorithms for Stroke From Electronic Health Records Using Natural Language Processing
    Zhao, Yiqing
    Fu, Suyang
    Larson, Nicholas B.
    Decker, Paul A.
    Chamberlain, Alanna M.
    Roger, Veronique L.
    Liu, Hongfang
    Bielinski, Suzette J.
    [J]. CIRCULATION, 2019, 139
  • [40] RETRACTED ARTICLE: Analysis of Electronic Health Records Based on Deep Learning with Natural Language Processing
    Yi-Cheng Shen
    Te-Chun Hsia
    Ching-Hsien Hsu
    [J]. Arabian Journal for Science and Engineering, 2023, 48 : 2597 - 2597