共 14 条
Using ICD-9 diagnostic codes for external validation of topic models derived from primary care electronic medical record clinical text data
被引:0
|作者:
Meaney, Christopher
[1
,4
]
Escobar, Michael
[1
]
Stukel, Therese A.
[1
,2
]
Austin, Peter C.
[1
]
Kalia, Sumeet
[1
]
Aliarzadeh, Babak
[1
]
Moineddin, Rahim
[1
]
Greiver, Michelle
[1
,3
]
机构:
[1] Univ Toronto, Toronto, ON, Canada
[2] ICES, Toronto, ON, Canada
[3] Univ Toronto, North York Gen Hosp, Toronto, ON, Canada
[4] Univ Toronto, Dept Family & Community Med, 500 Univ Ave Su 346, Toronto, ON M5G 1V7, Canada
基金:
加拿大健康研究院;
关键词:
non-negative matrix factorization;
topic model;
external validation;
concurrent validity;
convergent validity;
discriminant validity;
clinical text data;
ICD-9;
codes;
electronic medical record;
ALGORITHMS;
D O I:
10.1177/14604582221115667
中图分类号:
R19 [保健组织与事业(卫生事业管理)];
学科分类号:
摘要:
Background/Objectives: Unsupervised topic models are often used to facilitate improved understanding of large unstructured clinical text datasets. In this study we investigated how ICD-9 diagnostic codes, collected alongside clinical text data, could be used to establish concurrent-, convergent- and discriminant-validity of learned topic models. Design/Setting: Retrospective open cohort design. Data were collected from primary care clinics located in Toronto, Canada between 01/01/2017 through 12/31/2020. Methods: We fit a non-negative matrix factorization topic model, with K = 50 latent topics/themes, to our input document term matrix (DTM). We estimated the magnitude of association between each Boolean-valued ICD-9 diagnostic code and each continuous latent topical vector. We identified ICD-9 diagnostic codes most strongly associated with each latent topical vector; and qualitatively interpreted how these codes could be used for external validation of the learned topic model. Results: The DTM consisted of 382,666 documents and 2210 words/tokens. We correlated concurrently assigned ICD-9 diagnostic codes with learned topical vectors, and observed semantic agreement for a subset of latent constructs (e.g. conditions of the breast, disorders of the female genital tract, respiratory disease, viral infection, eye/ear/nose/throat conditions, conditions of the urinary system, and dermatological conditions, etc.). Conclusions: When fitting topic models to clinical text corpora, researchers can leverage contemporaneously collected electronic medical record data to investigate the external validity of fitted latent variable models.
引用
收藏
页数:28
相关论文