Machine learning prediction of incidence of Alzheimer's disease using large-scale administrative health data

被引:54
|
作者
Park, Ji Hwan [1 ]
Cho, Han Eol [2 ,3 ]
Kim, Jong Hun [4 ]
Wall, Melanie M. [5 ]
Stern, Yaakov [5 ,6 ]
Lim, Hyunsun [7 ]
Yoo, Shinjae [1 ]
Kim, Hyoung Seop [8 ]
Cha, Jiook [5 ,9 ,10 ,11 ]
机构
[1] Brookhaven Natl Lab, Computat Sci Initiat, Upton, NY 11973 USA
[2] Yonsei Univ, Gangnam Severance Hosp, Dept Rehabil Med, Coll Med, Seoul, South Korea
[3] Yonsei Univ, Coll Med, Rehabil Inst Neuromuscular Dis, Seoul, South Korea
[4] Ilsan Hosp, Dementia Ctr, Dept Neurol, Natl Hlth Insurance Serv, Goyang, South Korea
[5] Columbia Univ, Vagelos Coll Phys & Surg, Dept Psychiat, New York, NY 10025 USA
[6] Columbia Univ, Vagelos Coll Phys & Surg, Dept Neurol, New York, NY 10025 USA
[7] Ilsan Hosp, Natl Hlth Insurance Serv, Res & Anal Team, Goyang, South Korea
[8] Ilsan Hosp, Dementia Ctr, Dept Phys Med & Rehabil, Natl Hlth Insurance Serv, Goyang, South Korea
[9] Seoul Natl Univ, Dept Psychol, Seoul, South Korea
[10] Seoul Natl Univ, Dept Brain & Cognit Sci, Seoul, South Korea
[11] Seoul Natl Univ, Grad Sch Data Sci, Seoul, South Korea
基金
新加坡国家研究基金会;
关键词
DEMENTIA RISK; COGNITIVE DEFICITS; OLDER PERSONS; POPULATION; DYSFUNCTION; MODELS; ANEMIA; SAMPLE; COHORT;
D O I
10.1038/s41746-020-0256-0
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Nationwide population-based cohort provides a new opportunity to build an automated risk prediction model based on individuals' history of health and healthcare beyond existing risk prediction models. We tested the possibility of machine learning models to predict future incidence of Alzheimer's disease (AD) using large-scale administrative health data. From the Korean National Health Insurance Service database between 2002 and 2010, we obtained de-identified health data in elders above 65 years (N = 40,736) containing 4,894 unique clinical features including ICD-10 codes, medication codes, laboratory values, history of personal and family illness and socio-demographics. To define incident AD we considered two operational definitions: "definite AD" with diagnostic codes and dementia medication (n = 614) and "probable AD" with only diagnosis (n = 2026). We trained and validated random forest, support vector machine and logistic regression to predict incident AD in 1, 2, 3, and 4 subsequent years. For predicting future incidence of AD in balanced samples (bootstrapping), the machine learning models showed reasonable performance in 1-year prediction with AUC of 0.775 and 0.759, based on "definite AD" and "probable AD" outcomes, respectively; in 2-year, 0.730 and 0.693; in 3-year, 0.677 and 0.644; in 4-year, 0.725 and 0.683. The results were similar when the entire (unbalanced) samples were used. Important clinical features selected in logistic regression included hemoglobin level, age and urine protein level. This study may shed a light on the utility of the data-driven machine learning model based on large-scale administrative health data in AD risk prediction, which may enable better selection of individuals at risk for AD in clinical trials or early detection in clinical settings.
引用
下载
收藏
页数:7
相关论文
共 50 条
  • [1] Machine learning prediction of incidence of Alzheimer’s disease using large-scale administrative health data
    Ji Hwan Park
    Han Eol Cho
    Jong Hun Kim
    Melanie M. Wall
    Yaakov Stern
    Hyunsun Lim
    Shinjae Yoo
    Hyoung Seop Kim
    Jiook Cha
    npj Digital Medicine, 3
  • [2] Alzheimer's Disease Risk Assessment Using Large-Scale Machine Learning Methods
    Casanova, Ramon
    Hsu, Fang-Chi
    Sink, Kaycee M.
    Rapp, Stephen R.
    Williamson, Jeff D.
    Resnick, Susan M.
    Espeland, Mark A.
    PLOS ONE, 2013, 8 (11):
  • [3] Machine learning based survival prediction in Glioma using large-scale registry data
    Zhao, Rachel
    Zhuge, Ying
    Camphausen, Kevin
    Krauze, Andra, V
    HEALTH INFORMATICS JOURNAL, 2022, 28 (04)
  • [4] Investigating Parkinson’s disease risk across farming activities using data mining and large-scale administrative health data
    Pascal Petit
    François Berger
    Vincent Bonneterre
    Nicolas Vuillerme
    npj Parkinson's Disease, 11 (1)
  • [5] Large-Scale Machine Learning for Business Sector Prediction
    Angenent, Mitch N.
    Barata, Antonio Pereira
    Takes, Frank W.
    PROCEEDINGS OF THE 35TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING (SAC'20), 2020, : 1143 - 1146
  • [6] Evaluation of Machine Learning Methods on Large-Scale Spatiotemporal Data for Photovoltaic Power Prediction
    Sauter, Evan
    Mughal, Maqsood
    Zhang, Ziming
    ENERGIES, 2023, 16 (13)
  • [7] Conformal Prediction in Spark: Large-Scale Machine Learning with Confidence
    Capuccini, Marco
    Carlsson, Lars
    Norinder, Ulf
    Spjuth, Ola
    2015 IEEE/ACM 2ND INTERNATIONAL SYMPOSIUM ON BIG DATA COMPUTING (BDC), 2015, : 61 - 67
  • [8] Large-scale data mining using genetics-based machine learning
    Bacardit, Jaume
    Llora, Xavier
    WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2013, 3 (01) : 37 - 61
  • [9] Humanization of antibodies using a machine learning approach on large-scale repertoire data
    Marks, Claire
    Hummer, Alissa M.
    Chin, Mark
    Deane, Charlotte M.
    BIOINFORMATICS, 2021, 37 (22) : 4041 - 4047
  • [10] A machine learning model for Alzheimer's disease prediction
    Rani, Pooja
    Lamba, Rohit
    Sachdeva, Ravi Kumar
    Kumar, Karan
    Iwendi, Celestine
    IET CYBER-PHYSICAL SYSTEMS: THEORY & APPLICATIONS, 2024, 9 (02) : 125 - 134