ARCH: Large-scale knowledge graph via aggregated narrative codified health records analysis

被引:0
|
作者
Gan, Ziming [1 ]
Zhou, Doudou [2 ]
Rush, Everett [3 ]
Panickan, Vidul A. [4 ,5 ]
Hoe, Yuk-Lam [5 ]
Ostrouchovm, George [3 ]
Xu, Zhiwei [6 ]
Shen, Shuting [7 ]
Xiong, Xin [8 ]
Greco, Kimberly F. [8 ]
Hong, Chuan [7 ]
Bonzel, Clara-Lea [4 ]
Wend, Jun [4 ]
Costa, Lauren [5 ]
Cai, Tianrun [5 ,9 ]
Begoli, Edmon
Xiaj, Zongqi [10 ]
Gaziano, J. Michael [5 ,9 ]
Liao, Katherine P. [5 ,9 ]
Cho, Kelly [5 ,9 ]
Cai, Tianxi [4 ,5 ,8 ]
Lu, Junwei [5 ,8 ]
机构
[1] Univ Chicago, Dept Stat, 5801 S Ellis Ave, Chicago, IL 60615 USA
[2] Natl Univ Singapore, Dept Stat & Data Sci, Singapore 117546, Singapore
[3] Oak Ridge Natl Lab, Bethel Valley Rd, Oak Ridge, TN 37830 USA
[4] Harvard Med Sch, 25 Shattuck St, Boston, MA 02115 USA
[5] VA Boston Healthcare Syst, 150 S Huntington Ave, Boston, MA 02130 USA
[6] Univ Michigan, Dept Stat, 500 S State St, Ann Arbor, MI 48109 USA
[7] Duke Univ, Dept Biostat & Bioinformat, 1121 West Main St, Durham, NC 27708 USA
[8] Harvard TH Chan Sch Publ Hlth, 677 Huntington Ave, Boston, MA 02115 USA
[9] Brigham & Womens Hosp, 60 Fenwood Rd, Boston, MA 02115 USA
[10] Univ Pittsburgh, Clin & Translat Sci, 3501 Fifth Ave, Pittsburgh, PA 15260 USA
关键词
Electronic health records; Natural language processing; Representation learning; Knowledge graph; ALZHEIMER-DISEASE; IDENTIFY; MODERATE; RISK;
D O I
10.1016/j.jbi.2024.104761
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Objective: Electronic health record (EHR) systems contain a wealth of clinical data stored as both codified data and free-text narrative notes (NLP). The complexity of EHR presents challenges in feature representation, information extraction, and uncertainty quantification. To address these challenges, we proposed an efficient Aggregated naRrative Codified Health (ARCH) records analysis to generate a large-scale knowledge graph (KG) for a comprehensive set of EHR codified and narrative features. Methods: Using data from 12.5 million Veterans Affairs patients, ARCH first derives embedding vectors and generates similarities along with associated p-values to measure the strength of relatedness between clinical features with statistical certainty quantification. Next, ARCH performs a sparse embedding regression to remove indirect linkage between features to build a sparse KG. Finally, ARCH was validated on various clinical tasks, including detecting known relationships between entity pairs, predicting drug side effects, disease phenotyping, as well as sub-typing Alzheimer's disease patients. Results: ARCH produces high-quality clinical embeddings and KG for over 60,000 codified and narrative EHR concepts. The KG and embeddings are visualized in the R-shiny powered web-API.3 ARCH achieved high accuracy in detecting EHR concept relationships, with AUCs of 0.926 (codified) and 0.861 (NLP) for similar EHR concepts, and 0.810 (codified) and 0.843 (NLP) for related pairs. It detected drug side effects with a 0.723 AUC, which improved to 0.826 after fine-tuning. Using both codified and NLP features, the detection power increased significantly. Compared to other methods, ARCH has superior accuracy and enhances weakly supervised phenotyping algorithms' performance. Notably, it successfully categorized Alzheimer's patients into two subgroups with varying mortality rates. Conclusion: The proposed ARCH algorithm generates large-scale high-quality semantic representations and knowledge graph for both codified and NLP EHR features, useful for a wide range of predictive modeling tasks.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Analysis of large-scale distributed knowledge sources via autonomous cooperative graph mining
    Levchuk, Georgiy
    Ortiz, Andres
    Yan, Xifeng
    MACHINE INTELLIGENCE AND BIO-INSPIRED COMPUTATION: THEORY AND APPLICATIONS VIII, 2014, 9119
  • [2] Large-scale knowledge graph representation learning
    Badrouni, Marwa
    Katar, Chaker
    Inoubli, Wissem
    KNOWLEDGE AND INFORMATION SYSTEMS, 2024, 66 (09) : 5479 - 5499
  • [3] ASER: A Large-scale Eventuality Knowledge Graph
    Zhang, Hongming
    Liu, Xin
    Pan, Haojie
    Song, Yangqiu
    Leung, Cane Wing-Ki
    WEB CONFERENCE 2020: PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE (WWW 2020), 2020, : 201 - 211
  • [4] Large-scale Entity Alignment via Knowledge Graph Merging, Partitioning and Embedding
    Xin, Kexuan
    Sun, Zequn
    Hua, Wen
    Hu, Wei
    Qu, Jianfeng
    Zhou, Xiaofang
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2022, 2022, : 2240 - 2249
  • [5] Large-scale knowledge graph representations of disease processes
    Hoch, Matti
    Gupta, Shailendra
    Wolkenhauer, Olaf
    CURRENT OPINION IN SYSTEMS BIOLOGY, 2024, 38
  • [6] Leveraging Semantics for Large-Scale Knowledge Graph Evaluation
    Rashid, Sabbir M.
    Viswanathan, Amar
    Gross, Ian
    Kendall, Elisa
    McGuinness, Deborah L.
    PROCEEDINGS OF THE 2017 ACM WEB SCIENCE CONFERENCE (WEBSCI '17), 2017, : 437 - 442
  • [7] A New Graph-Partitioning Algorithm for Large-Scale Knowledge Graph
    Zhong, Jiang
    Wang, Chen
    Li, Qi
    Li, Qing
    ADVANCED DATA MINING AND APPLICATIONS, ADMA 2018, 2018, 11323 : 434 - 444
  • [8] Building a Large-Scale Knowledge Graph for Elementary Education in China
    Zheng, Wei
    Wang, Zhichun
    Sun, Mingchen
    Wu, Yanrong
    Li, Kaiman
    SEMANTIC TECHNOLOGY, JIST 2019, 2020, 1157 : 1 - 12
  • [9] LKAQ: Large-scale knowledge graph approximate query algorithm
    Wan, Xiaolong
    Wang, Hongzhi
    Li, Jianzhong
    INFORMATION SCIENCES, 2019, 505 : 306 - 324
  • [10] MMpedia: A Large-Scale Multi-modal Knowledge Graph
    Wu, Yinan
    Wu, Xiaowei
    Li, Junwen
    Zhang, Yue
    Wang, Haofen
    Du, Wen
    He, Zhidong
    Liu, Jingping
    Ruan, Tong
    SEMANTIC WEB, ISWC 2023, PT II, 2023, 14266 : 18 - 37