Comparing natural language processing representations of coded disease sequences for prediction in electronic health records

被引:2
|
作者
Beaney, Thomas [1 ,2 ,4 ]
Jha, Sneha [2 ]
Alaa, Asem [2 ]
Smith, Alexander [3 ]
Clarke, Jonathan [2 ]
Woodcock, Thomas [1 ]
Majeed, Azeem [1 ]
Aylin, Paul [1 ]
Barahona, Mauricio [2 ]
机构
[1] Imperial Coll London, Dept Primary Care & Publ Hlth, London W12 0BZ, England
[2] Imperial Coll London, Ctr Math Precis Healthcare, Dept Math, London SW7 2AZ, England
[3] Imperial Coll London, Dept Epidemiol & Biostat, London W2 1PG, England
[4] Imperial Coll London, Sch Publ Hlth, Dept Primary Care & Publ Hlth, 90 Wood Lane, London W12 0BZ, England
关键词
Multiple Long Term Conditions; Long-Term Conditions; diseases; representations; prediction; natural language processing; MULTIMORBIDITY; MAP;
D O I
10.1093/jamia/ocae091
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objective Natural language processing (NLP) algorithms are increasingly being applied to obtain unsupervised representations of electronic health record (EHR) data, but their comparative performance at predicting clinical endpoints remains unclear. Our objective was to compare the performance of unsupervised representations of sequences of disease codes generated by bag-of-words versus sequence-based NLP algorithms at predicting clinically relevant outcomes.Materials and Methods This cohort study used primary care EHRs from 6 286 233 people with Multiple Long-Term Conditions in England. For each patient, an unsupervised vector representation of their time-ordered sequences of diseases was generated using 2 input strategies (212 disease categories versus 9462 diagnostic codes) and different NLP algorithms (Latent Dirichlet Allocation, doc2vec, and 2 transformer models designed for EHRs). We also developed a transformer architecture, named EHR-BERT, incorporating sociodemographic information. We compared the performance of each of these representations (without fine-tuning) as inputs into a logistic classifier to predict 1-year mortality, healthcare use, and new disease diagnosis.Results Patient representations generated by sequence-based algorithms performed consistently better than bag-of-words methods in predicting clinical endpoints, with the highest performance for EHR-BERT across all tasks, although the absolute improvement was small. Representations generated using disease categories perform similarly to those using diagnostic codes as inputs, suggesting models can equally manage smaller or larger vocabularies for prediction of these outcomes.Discussion and Conclusion Patient representations produced by sequence-based NLP algorithms from sequences of disease codes demonstrate improved predictive content for patient outcomes compared with representations generated by co-occurrence-based algorithms. This suggests transformer models may be useful for generating multi-purpose representations, even without fine-tuning.
引用
收藏
页码:1451 / 1462
页数:12
相关论文
共 50 条
  • [1] Natural Language Processing to Improve Prediction of Incident Atrial Fibrillation Using Electronic Health Records
    Ashburner, Jeffrey M.
    Chang, Yuchiao
    Wang, Xin
    Khurshid, Shaan
    Anderson, Christopher D.
    Dahal, Kumar
    Weisenfeld, Dana
    Cai, Tianrun
    Liao, Katherine P.
    Wagholikar, Kavishwar B.
    Murphy, Shawn N.
    Atlas, Steven J.
    Lubitz, Steven A.
    Singer, Daniel E.
    [J]. JOURNAL OF THE AMERICAN HEART ASSOCIATION, 2022, 11 (15):
  • [2] Using Natural Language Processing on Electronic Health Records to Enhance Detection and Prediction of Psychosis Risk
    Irving, Jessica
    Patel, Rashmi
    Oliver, Dominic
    Colling, Craig
    Pritchard, Megan
    Broadbent, Matthew
    Baldwin, Helen
    Stahl, Daniel
    Stewart, Robert
    Fusar-Poli, Paolo
    [J]. SCHIZOPHRENIA BULLETIN, 2021, 47 (02) : 405 - 414
  • [3] Prediction of Psychiatric Readmission Risk in Psychosis Patients With Natural Language Processing of Electronic Health Records
    Mellado, Elena Alvarez
    Holderness, Eben
    Miller, Nicholas
    Bolton, Kirsten
    Cawkwell, Philip
    Pustejovsky, James
    Hall, Mei-Hua
    [J]. NEUROPSYCHOPHARMACOLOGY, 2019, 44 (SUPPL 1) : 187 - 187
  • [4] Using Natural Language Processing to Predict Risk in Electronic Health Records
    Duy Van Le
    Montgomery, James
    Kirkby, Kenneth
    Scanlan, Joel
    [J]. MEDINFO 2023 - THE FUTURE IS ACCESSIBLE, 2024, 310 : 574 - 578
  • [5] Psychosis Relapse Prediction Leveraging Electronic Health Records Data and Natural Language Processing Enrichment Methods
    Lee, Dong Yun
    Kim, Chungsoo
    Lee, Seongwon
    Son, Sang Joon
    Cho, Sun-Mi
    Cho, Yong Hyuk
    Lim, Jaegyun
    Park, Rae Woong
    [J]. FRONTIERS IN PSYCHIATRY, 2022, 13
  • [6] Prediction and evaluation of combination pharmacotherapy using natural language processing, machine learning and patient electronic health records
    Ding, Pingjian
    Pan, Yiheng
    Wang, Quanqiu
    Xu, Rong
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2022, 133
  • [7] Natural language processing and machine learning of electronic health records for prediction of first-time suicide attempts
    Tsui, Fuchiang R.
    Shi, Lingyun
    Ruiz, Victor
    Ryan, Neal D.
    Biernesser, Candice
    Iyengar, Satish
    Walsh, Colin G.
    Brent, David A.
    [J]. JAMIA OPEN, 2021, 4 (01)
  • [8] Risk prediction using natural language processing of electronic mental health records in an inpatient forensic psychiatry setting
    Duy Van Le
    Montgomery, James
    Kirkby, Kenneth C.
    Scanlan, Joel
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2018, 86 : 49 - 58
  • [9] Natural language processing to identify lupus nephritis phenotype in electronic health records
    Deng, Yu
    Pacheco, Jennifer A.
    Ghosh, Anika
    Chung, Anh
    Mao, Chengsheng
    Smith, Joshua C.
    Zhao, Juan
    Wei, Wei-Qi
    Barnado, April
    Dorn, Chad
    Weng, Chunhua
    Liu, Cong
    Cordon, Adam
    Yu, Jingzhi
    Tedla, Yacob
    Kho, Abel
    Ramsey-Goldman, Rosalind
    Walunas, Theresa
    Luo, Yuan
    [J]. BMC MEDICAL INFORMATICS AND DECISION MAKING, 2024, 22 (SUPPL 2)