Using machine learning to link electronic health records in cancer registries: On the tradeoff between linkage quality and manual effort

被引:0
|
作者
Rochner, Philipp [1 ,2 ]
Rothlauf, Franz [2 ]
机构
[1] Inst Digital Hlth Data Rhineland Palatinate, Canc Registry, Grosse Ble 46, D-55116 Mainz, Germany
[2] Johannes Gutenberg Univ Mainz, Informat Syst & Business Adm, Jakob Welder Weg 9, D-55128 Mainz, Germany
关键词
Record linkage; Data matching; Cancer registry; Electronic health records; Machine learning; Data quality;
D O I
10.1016/j.ijmedinf.2024.105387
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Background: Cancer registries link a large number of electronic health records reported by medical institutions to already registered records of the matching individual and tumor. Records are automatically linked using deterministic and probabilistic approaches; machine learning is rarely used. Records that cannot be matched automatically with sufficient accuracy are typically processed manually. For application, it is important to know how well record linkage approaches match real-world records and how much manual effort is required to achieve the desired linkage quality. We study the task of linking reported records to the matching registered tumor in cancer registries. Methods: We compare the tradeoff between linkage quality and manual effort of five machine learning methods (logistic regression, random forest, gradient boosting, neural network, and a stacked method) to a deterministic baseline. The record linkage methods are compared in a two-class setting (no-match/ match) and a three-class setting (no-match/ undecided/ match). A cancer registry collected and linked the dataset consisting of categorical variables matching 145,755 reported records with 33,289 registered tumors. Results: In the two-class setting, the gradient boosting, neural network, and stacked models have higher accuracy and F1 score (accuracy: 0.968 - 0.978, F1 score: 0.983 - 0.988) than the deterministic baseline (accuracy: 0.964, F1 score: 0.980) when the same records are manually processed (0.89% of all records). In the three-class setting, these three machine learning methods can automatically process all reported records and still have higher accuracy and F1 score than the deterministic baseline. The linkage quality of the machine learning methods studied, except for the neural network, increase as the number of manually processed records increases. Conclusion: Machine learning methods can significantly improve linkage quality and reduce the manual effort required by medical coders to match tumor records in cancer registries compared to a deterministic baseline. Our results help cancer registries estimate how linkage quality increases as more records are manually processed.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Subphenotyping depression using machine learning and electronic health records
    Xu, Zhenxing
    Wang, Fei
    Adekkanattu, Prakash
    Bose, Budhaditya
    Vekaria, Veer
    Brandt, Pascal
    Jiang, Guoqian
    Kiefer, Richard C.
    Luo, Yuan
    Pacheco, Jennifer A.
    Rasmussen, Luke V.
    Xu, Jie
    Alexopoulos, George
    Pathak, Jyotishman
    [J]. LEARNING HEALTH SYSTEMS, 2020, 4 (04):
  • [2] A MACHINE LEARNING MODEL FOR CANCER BIOMARKER IDENTIFICATION IN ELECTRONIC HEALTH RECORDS
    Ambwani, G.
    Cohen, A.
    Estevez, M.
    Singh, N.
    Adamson, B.
    Nussbaum, N. C.
    Birnbaum, B.
    [J]. VALUE IN HEALTH, 2019, 22 : S334 - S334
  • [3] Identifying Breast Cancer Distant Recurrences from Electronic Health Records Using Machine Learning
    Zeng, Zexian
    Yao, Liang
    Roy, Ankita
    Li, Xiaoyu
    Espino, Sasa
    Clare, Susan E.
    Khan, Seema A.
    Luo, Yuan
    [J]. JOURNAL OF HEALTHCARE INFORMATICS RESEARCH, 2019, 3 (03) : 283 - 299
  • [4] Identifying Breast Cancer Distant Recurrences from Electronic Health Records Using Machine Learning
    Zexian Zeng
    Liang Yao
    Ankita Roy
    Xiaoyu Li
    Sasa Espino
    Susan E Clare
    Seema A Khan
    Yuan Luo
    [J]. Journal of Healthcare Informatics Research, 2019, 3 : 283 - 299
  • [5] Using Electronic Health Records and Machine Learning to Predict Postpartum Depression
    Wang, Shuojia
    Pathak, Jyotishman
    Zhang, Yiye
    [J]. MEDINFO 2019: HEALTH AND WELLBEING E-NETWORKS FOR ALL, 2019, 264 : 888 - 892
  • [6] Using machine learning to detect sarcopenia from electronic health records
    Luo, Xiao
    Ding, Haoran
    Broyles, Andrea
    Warden, Stuart J.
    Moorthi, Ranjani N.
    Imel, Erik A.
    [J]. DIGITAL HEALTH, 2023, 9
  • [7] Descriptive and Predictive Analytics on Electronic Health Records using Machine Learning
    Anandi, V
    Ramesh, M.
    [J]. 2022 SECOND INTERNATIONAL CONFERENCE ON ADVANCES IN ELECTRICAL, COMPUTING, COMMUNICATION AND SUSTAINABLE TECHNOLOGIES (ICAECT), 2022,
  • [8] Using Machine Learning and Electronic Health Records to Predict Postpartum Depression
    Zhang, Yiye
    Joly, Rochelle
    Hermann, Alison
    Pathak, Jyotishman
    [J]. OBSTETRICS AND GYNECOLOGY, 2020, 135 : 59S - 60S
  • [9] Development and validation of a pancreatic cancer prediction model from electronic health records using machine learning
    Appelbaum, Limor
    Cambronero, Jose Pablo
    Pollick, Karla
    Silva, George
    Stevens, Jennifer P.
    Mamon, Harvey J.
    Kaplan, Irving D.
    Rinard, Martin
    [J]. JOURNAL OF CLINICAL ONCOLOGY, 2020, 38 (04)
  • [10] Individualized melanoma risk prediction using machine learning with electronic health records
    Wan, G.
    Nguyen, N.
    Yan, B.
    Khattab, S.
    Estiri, H.
    Semenov, Y.
    [J]. JOURNAL OF INVESTIGATIVE DERMATOLOGY, 2024, 144 (08) : S35 - S35