Using ICD-9 diagnostic codes for external validation of topic models derived from primary care electronic medical record clinical text data

被引:0
|
作者
Meaney, Christopher [1 ,4 ]
Escobar, Michael [1 ]
Stukel, Therese A. [1 ,2 ]
Austin, Peter C. [1 ]
Kalia, Sumeet [1 ]
Aliarzadeh, Babak [1 ]
Moineddin, Rahim [1 ]
Greiver, Michelle [1 ,3 ]
机构
[1] Univ Toronto, Toronto, ON, Canada
[2] ICES, Toronto, ON, Canada
[3] Univ Toronto, North York Gen Hosp, Toronto, ON, Canada
[4] Univ Toronto, Dept Family & Community Med, 500 Univ Ave Su 346, Toronto, ON M5G 1V7, Canada
基金
加拿大健康研究院;
关键词
non-negative matrix factorization; topic model; external validation; concurrent validity; convergent validity; discriminant validity; clinical text data; ICD-9; codes; electronic medical record; ALGORITHMS;
D O I
10.1177/14604582221115667
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Background/Objectives: Unsupervised topic models are often used to facilitate improved understanding of large unstructured clinical text datasets. In this study we investigated how ICD-9 diagnostic codes, collected alongside clinical text data, could be used to establish concurrent-, convergent- and discriminant-validity of learned topic models. Design/Setting: Retrospective open cohort design. Data were collected from primary care clinics located in Toronto, Canada between 01/01/2017 through 12/31/2020. Methods: We fit a non-negative matrix factorization topic model, with K = 50 latent topics/themes, to our input document term matrix (DTM). We estimated the magnitude of association between each Boolean-valued ICD-9 diagnostic code and each continuous latent topical vector. We identified ICD-9 diagnostic codes most strongly associated with each latent topical vector; and qualitatively interpreted how these codes could be used for external validation of the learned topic model. Results: The DTM consisted of 382,666 documents and 2210 words/tokens. We correlated concurrently assigned ICD-9 diagnostic codes with learned topical vectors, and observed semantic agreement for a subset of latent constructs (e.g. conditions of the breast, disorders of the female genital tract, respiratory disease, viral infection, eye/ear/nose/throat conditions, conditions of the urinary system, and dermatological conditions, etc.). Conclusions: When fitting topic models to clinical text corpora, researchers can leverage contemporaneously collected electronic medical record data to investigate the external validity of fitted latent variable models.
引用
收藏
页数:28
相关论文
共 14 条
  • [1] Utility of electronic medical record databases using ICD-9 criteria for recruitment in clinical research: From rare to common disease
    Thacker, T.
    Wegele, A. R.
    Richardson, S. Pirio
    [J]. MOVEMENT DISORDERS, 2015, 30 : S446 - S446
  • [2] DEVELOPMENT, INTERNAL VALIDATION AND INDEPENDENT EXTERNAL VALIDATION OF AN ELECTRONIC FRAILTY INDEX USING ROUTINE PRIMARY CARE ELECTRONIC HEALTH RECORD DATA
    Clegg, A.
    Bates, C.
    Young, J.
    Ryan, R.
    Nichols, L.
    Teale, E.
    Mohammed, M.
    Parry, J.
    Marshall, T.
    [J]. AGE AND AGEING, 2017, 46
  • [3] From patient care to research: a validation study examining the factors contributing to data quality in a primary care electronic medical record database
    Nathan Coleman
    Gayle Halas
    William Peeler
    Natalie Casaclang
    Tyler Williamson
    Alan Katz
    [J]. BMC Family Practice, 16
  • [4] From patient care to research: a validation study examining the factors contributing to data quality in a primary care electronic medical record database
    Coleman, Nathan
    Halas, Gayle
    Peeler, William
    Casaclang, Natalie
    Williamson, Tyler
    Katz, Alan
    [J]. BMC FAMILY PRACTICE, 2015, 16
  • [5] Comparison of Methods for Estimating Temporal Topic Models From Primary Care Clinical Text Data: Retrospective Closed Cohort Study
    Meaney, Christopher
    Escobar, Michael
    Stukel, Therese A.
    Austin, Peter C.
    Jaakkimainen, Liisa
    [J]. JMIR MEDICAL INFORMATICS, 2022, 10 (12)
  • [6] A data quality assessment to inform hypertension surveillance using primary care electronic medical record data from Alberta, Canada
    Stephanie Garies
    Kerry McBrien
    Hude Quan
    Donna Manca
    Neil Drummond
    Tyler Williamson
    [J]. BMC Public Health, 21
  • [7] A data quality assessment to inform hypertension surveillance using primary care electronic medical record data from Alberta, Canada
    Garies, Stephanie
    McBrien, Kerry
    Quan, Hude
    Manca, Donna
    Drummond, Neil
    Williamson, Tyler
    [J]. BMC PUBLIC HEALTH, 2021, 21 (01)
  • [8] VALIDATION OF COPD ICD10 DIAGNOSTIC CODES IN THE SIDIAP PRIMARY HEALTH CARE RESEARCH DATABASE USING A COMBINATION OF SPIROMETRY MEASURES, SYMPTOMS, DRUG USE, AND FREE TEXT REVIEW.
    Reyes, C.
    Aragon, M.
    Rijnbeek, P.
    Van der Lei, J.
    Verhamme, K.
    Prieto-Alhambra, D.
    [J]. VALUE IN HEALTH, 2018, 21 : S368 - S368
  • [9] Identification of Dyslipidemic Patients Attending Primary Care Clinics Using Electronic Medical Record (EMR) Data from the Canadian Primary Care Sentinel Surveillance Network (CPCSSN) Database
    Aref-Eshghi, Erfan
    Oake, Justin
    Godwin, Marshall
    Aubrey-Bassler, Kris
    Duke, Pauline
    Mahdavian, Masoud
    Asghari, Shabnam
    [J]. JOURNAL OF MEDICAL SYSTEMS, 2017, 41 (03)
  • [10] Identification of Dyslipidemic Patients Attending Primary Care Clinics Using Electronic Medical Record (EMR) Data from the Canadian Primary Care Sentinel Surveillance Network (CPCSSN) Database
    Erfan Aref-Eshghi
    Justin Oake
    Marshall Godwin
    Kris Aubrey-Bassler
    Pauline Duke
    Masoud Mahdavian
    Shabnam Asghari
    [J]. Journal of Medical Systems, 2017, 41