ARCH: Large-scale knowledge graph via aggregated narrative codified health records analysis

被引:0
|
作者
Gan, Ziming [1 ]
Zhou, Doudou [2 ]
Rush, Everett [3 ]
Panickan, Vidul A. [4 ,5 ]
Hoe, Yuk-Lam [5 ]
Ostrouchovm, George [3 ]
Xu, Zhiwei [6 ]
Shen, Shuting [7 ]
Xiong, Xin [8 ]
Greco, Kimberly F. [8 ]
Hong, Chuan [7 ]
Bonzel, Clara-Lea [4 ]
Wend, Jun [4 ]
Costa, Lauren [5 ]
Cai, Tianrun [5 ,9 ]
Begoli, Edmon
Xiaj, Zongqi [10 ]
Gaziano, J. Michael [5 ,9 ]
Liao, Katherine P. [5 ,9 ]
Cho, Kelly [5 ,9 ]
Cai, Tianxi [4 ,5 ,8 ]
Lu, Junwei [5 ,8 ]
机构
[1] Univ Chicago, Dept Stat, 5801 S Ellis Ave, Chicago, IL 60615 USA
[2] Natl Univ Singapore, Dept Stat & Data Sci, Singapore 117546, Singapore
[3] Oak Ridge Natl Lab, Bethel Valley Rd, Oak Ridge, TN 37830 USA
[4] Harvard Med Sch, 25 Shattuck St, Boston, MA 02115 USA
[5] VA Boston Healthcare Syst, 150 S Huntington Ave, Boston, MA 02130 USA
[6] Univ Michigan, Dept Stat, 500 S State St, Ann Arbor, MI 48109 USA
[7] Duke Univ, Dept Biostat & Bioinformat, 1121 West Main St, Durham, NC 27708 USA
[8] Harvard TH Chan Sch Publ Hlth, 677 Huntington Ave, Boston, MA 02115 USA
[9] Brigham & Womens Hosp, 60 Fenwood Rd, Boston, MA 02115 USA
[10] Univ Pittsburgh, Clin & Translat Sci, 3501 Fifth Ave, Pittsburgh, PA 15260 USA
关键词
Electronic health records; Natural language processing; Representation learning; Knowledge graph; ALZHEIMER-DISEASE; IDENTIFY; MODERATE; RISK;
D O I
10.1016/j.jbi.2024.104761
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Objective: Electronic health record (EHR) systems contain a wealth of clinical data stored as both codified data and free-text narrative notes (NLP). The complexity of EHR presents challenges in feature representation, information extraction, and uncertainty quantification. To address these challenges, we proposed an efficient Aggregated naRrative Codified Health (ARCH) records analysis to generate a large-scale knowledge graph (KG) for a comprehensive set of EHR codified and narrative features. Methods: Using data from 12.5 million Veterans Affairs patients, ARCH first derives embedding vectors and generates similarities along with associated p-values to measure the strength of relatedness between clinical features with statistical certainty quantification. Next, ARCH performs a sparse embedding regression to remove indirect linkage between features to build a sparse KG. Finally, ARCH was validated on various clinical tasks, including detecting known relationships between entity pairs, predicting drug side effects, disease phenotyping, as well as sub-typing Alzheimer's disease patients. Results: ARCH produces high-quality clinical embeddings and KG for over 60,000 codified and narrative EHR concepts. The KG and embeddings are visualized in the R-shiny powered web-API.3 ARCH achieved high accuracy in detecting EHR concept relationships, with AUCs of 0.926 (codified) and 0.861 (NLP) for similar EHR concepts, and 0.810 (codified) and 0.843 (NLP) for related pairs. It detected drug side effects with a 0.723 AUC, which improved to 0.826 after fine-tuning. Using both codified and NLP features, the detection power increased significantly. Compared to other methods, ARCH has superior accuracy and enhances weakly supervised phenotyping algorithms' performance. Notably, it successfully categorized Alzheimer's patients into two subgroups with varying mortality rates. Conclusion: The proposed ARCH algorithm generates large-scale high-quality semantic representations and knowledge graph for both codified and NLP EHR features, useful for a wide range of predictive modeling tasks.
引用
收藏
页数:11
相关论文
共 50 条
  • [31] Semantic guided knowledge graph for large-scale zero-shot learning
    Wei, Jiwei
    Sun, Haotian
    Yang, Yang
    Xu, Xing
    Li, Jingjing
    Shen, Heng Tao
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2022, 88
  • [32] A large-scale mobile application knowledge graph for the research of cybersecurity: Construction and application
    Li, Weizhuo
    Zhou, Heng
    Tan, Yiming
    Luo, Weiqi
    Ji, Qiu
    Bian, Yuyang
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2025, 149
  • [33] Retrieval-Enhanced Generative Model for Large-Scale Knowledge Graph Completion
    Yu, Donghan
    Yang, Yiming
    PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 2334 - 2338
  • [34] ASSESSING THE RELATIONSHIP BETWEEN PTSD AND TYPE 2 DIABETES IN A LARGE-SCALE ANALYSIS OF VETERAN HEALTH RECORDS
    Liang, Katharine
    Schindler, Abigail
    Hendrickson, Rebecca
    NEUROPSYCHOPHARMACOLOGY, 2024, 49 : 472 - 473
  • [35] Large-Scale Hierarchical Causal Discovery via Weak Prior Knowledge
    Wang, Xiangyu
    Ban, Taiyu
    Chen, Lyuzhou
    Lyu, Derui
    Zhu, Qinrui
    Chen, Huanhuan
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2025, 37 (05) : 2695 - 2711
  • [36] The Transmission of Knowledge via Large-Scale Technology: A Shared Agency Account
    Greco, John
    SOCIAL EPISTEMOLOGY, 2025,
  • [37] BAYESIAN COX REGRESSION FOR LARGE-SCALE INFERENCE WITH APPLICATIONS TO ELECTRONIC HEALTH RECORDS
    Jung, Alexander Wolfgang
    Gerstung, Moritz
    ANNALS OF APPLIED STATISTICS, 2023, 17 (02): : 1064 - 1085
  • [38] Mining large-scale news video database via knowledge visualization
    Luo, Hangzai
    Fan, Jianping
    Satoh, Shin'ichi
    Xue, Xiangyang
    ADVANCES IN VISUAL INFORMATION SYSTEMS, 2007, 4781 : 254 - +
  • [39] Large-Scale Multi-View Spectral Clustering via Bipartite Graph
    Li, Yeqing
    Nie, Feiping
    Huang, Heng
    Huang, Junzhou
    PROCEEDINGS OF THE TWENTY-NINTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2015, : 2750 - 2756
  • [40] Generative adversarial meta-learning knowledge graph completion for large-scale complex knowledge graphs
    Tong, Weiming
    Chu, Xu
    Li, Zhongwei
    Tan, Liguo
    Zhao, Jinxiao
    Pan, Feng
    JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2024, : 1685 - 1701