Comparing natural language processing representations of coded disease sequences for prediction in electronic health records

被引:2
|
作者
Beaney, Thomas [1 ,2 ,4 ]
Jha, Sneha [2 ]
Alaa, Asem [2 ]
Smith, Alexander [3 ]
Clarke, Jonathan [2 ]
Woodcock, Thomas [1 ]
Majeed, Azeem [1 ]
Aylin, Paul [1 ]
Barahona, Mauricio [2 ]
机构
[1] Imperial Coll London, Dept Primary Care & Publ Hlth, London W12 0BZ, England
[2] Imperial Coll London, Ctr Math Precis Healthcare, Dept Math, London SW7 2AZ, England
[3] Imperial Coll London, Dept Epidemiol & Biostat, London W2 1PG, England
[4] Imperial Coll London, Sch Publ Hlth, Dept Primary Care & Publ Hlth, 90 Wood Lane, London W12 0BZ, England
关键词
Multiple Long Term Conditions; Long-Term Conditions; diseases; representations; prediction; natural language processing; MULTIMORBIDITY; MAP;
D O I
10.1093/jamia/ocae091
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objective Natural language processing (NLP) algorithms are increasingly being applied to obtain unsupervised representations of electronic health record (EHR) data, but their comparative performance at predicting clinical endpoints remains unclear. Our objective was to compare the performance of unsupervised representations of sequences of disease codes generated by bag-of-words versus sequence-based NLP algorithms at predicting clinically relevant outcomes.Materials and Methods This cohort study used primary care EHRs from 6 286 233 people with Multiple Long-Term Conditions in England. For each patient, an unsupervised vector representation of their time-ordered sequences of diseases was generated using 2 input strategies (212 disease categories versus 9462 diagnostic codes) and different NLP algorithms (Latent Dirichlet Allocation, doc2vec, and 2 transformer models designed for EHRs). We also developed a transformer architecture, named EHR-BERT, incorporating sociodemographic information. We compared the performance of each of these representations (without fine-tuning) as inputs into a logistic classifier to predict 1-year mortality, healthcare use, and new disease diagnosis.Results Patient representations generated by sequence-based algorithms performed consistently better than bag-of-words methods in predicting clinical endpoints, with the highest performance for EHR-BERT across all tasks, although the absolute improvement was small. Representations generated using disease categories perform similarly to those using diagnostic codes as inputs, suggesting models can equally manage smaller or larger vocabularies for prediction of these outcomes.Discussion and Conclusion Patient representations produced by sequence-based NLP algorithms from sequences of disease codes demonstrate improved predictive content for patient outcomes compared with representations generated by co-occurrence-based algorithms. This suggests transformer models may be useful for generating multi-purpose representations, even without fine-tuning.
引用
收藏
页码:1451 / 1462
页数:12
相关论文
共 50 条
  • [21] Using Natural Language Processing to Identify Different Lens Pathology in Electronic Health Records
    Stein, Joshua d.
    Zhou, Yunshu
    Andrews, Chris a.
    Kim, Judy e.
    Addis, Victoria
    Bixler, Jill
    Grove, Nathan
    Mcmillan, Brian
    Munir, Saleha z.
    Pershing, Suzann
    Schultz, Jeffrey s.
    Stagg, Brian c.
    Wang, Sophia y.
    Woreta, Fasika
    [J]. AMERICAN JOURNAL OF OPHTHALMOLOGY, 2024, 262 : 153 - 160
  • [22] Natural language processing for electronic health records in anaesthesiology: an introduction to clinicians with recommendations and pitfalls
    Bernstorff, Martin
    Vistisen, Simon Tilma
    Enevoldsen, Kenneth C.
    [J]. JOURNAL OF CLINICAL MONITORING AND COMPUTING, 2024, 38 (02) : 241 - 245
  • [23] Natural Language Processing of Clinical Notes in Electronic Health Records to Improve Capture of Hypoglycemia
    Nunes, Anthony P.
    Yu, Shengsheng
    Kurtyka, Karen
    Senerchia, Cynthia
    Hill, Jefffrey
    Brodovicz, Kimberly G.
    Radican, Larry
    Engel, Samuel S.
    Calvo, Sean R.
    Dore, David D.
    [J]. PHARMACOEPIDEMIOLOGY AND DRUG SAFETY, 2014, 23 : 494 - 494
  • [24] NATURAL LANGUAGE PROCESSING METHODS ENHANCE MACE IDENTIFICATION FROM ELECTRONIC HEALTH RECORDS
    St Laurent, S.
    Guo, M.
    Alfonso, R.
    Okoro, T.
    Johansen, K.
    Dember, L.
    Lindsay, A.
    [J]. VALUE IN HEALTH, 2018, 21 : S217 - S217
  • [25] Natural language processing for electronic health records in anaesthesiology: an introduction to clinicians with recommendations and pitfalls
    Martin Bernstorff
    Simon Tilma Vistisen
    Kenneth C. Enevoldsen
    [J]. Journal of Clinical Monitoring and Computing, 2024, 38 : 241 - 245
  • [26] Development of a natural language processing algorithm to detect chronic cough in electronic health records
    Vishal Bali
    Jessica Weaver
    Vladimir Turzhitsky
    Jonathan Schelfhout
    Misti L. Paudel
    Erin Hulbert
    Jesse Peterson-Brandt
    Anne-Marie Guerra Currie
    Dylan Bakka
    [J]. BMC Pulmonary Medicine, 22
  • [27] Cohort design and natural language processing to reduce bias in electronic health records research
    Khurshid, Shaan
    Reeder, Christopher
    Harrington, Lia X.
    Singh, Pulkit
    Sarma, Gopal
    Friedman, Samuel F.
    Di Achille, Paolo
    Diamant, Nathaniel
    Cunningham, Jonathan W.
    Turner, Ashby C.
    Lau, Emily S.
    Haimovich, Julian S.
    Al-Alusi, Mostafa A.
    Wang, Xin
    Klarqvist, Marcus D. R.
    Ashburner, Jeffrey M.
    Diedrich, Christian
    Ghadessi, Mercedeh
    Mielke, Johanna
    Eilken, Hanna M.
    McElhinney, Alice
    Derix, Andrea
    Atlas, Steven J.
    Ellinor, Patrick T.
    Philippakis, Anthony A.
    Anderson, Christopher D.
    Ho, Jennifer E.
    Batra, Puneet
    Lubitz, Steven A.
    [J]. NPJ DIGITAL MEDICINE, 2022, 5 (01)
  • [28] Development of a natural language processing algorithm to detect chronic cough in electronic health records
    Bali, Vishal
    Weaver, Jessica
    Turzhitsky, Vladimir
    Schelfhout, Jonathan
    Paudel, Misti L.
    Hulbert, Erin
    Peterson-Brandt, Jesse
    Currie, Anne-Marie Guerra
    Bakka, Dylan
    [J]. BMC PULMONARY MEDICINE, 2022, 22 (01)
  • [29] Relevant Word Order Vectorization for Improved Natural Language Processing in Electronic Health Records
    Thompson, Jeffrey
    Hu, Jinxiang
    Mudaranthakam, Dinesh Pal
    Streeter, David
    Neums, Lisa
    Park, Michele
    Koestler, Devin C.
    Gajewski, Byron
    Jensen, Roy
    Mayo, Matthew S.
    [J]. SCIENTIFIC REPORTS, 2019, 9 (1)
  • [30] Relevant Word Order Vectorization for Improved Natural Language Processing in Electronic Health Records
    Jeffrey Thompson
    Jinxiang Hu
    Dinesh Pal Mudaranthakam
    David Streeter
    Lisa Neums
    Michele Park
    Devin C. Koestler
    Byron Gajewski
    Roy Jensen
    Matthew S. Mayo
    [J]. Scientific Reports, 9