Lifelong Machine Learning and root cause analysis for large-scale cancer patient data

被引:4
|
作者
Pal, Gautam [1 ]
Hong, Xianbin [2 ]
Wang, Zhuo [2 ]
Wu, Hongyi [2 ]
Li, Gangmin [2 ]
Atkinson, Katie [1 ]
机构
[1] Dept Comp Sci, Liverpool, Merseyside, England
[2] Xian Jiaotong Liverpool Univ, Res Inst Big Data Analyt, Suzhou, Peoples R China
关键词
Lifelong learning; Real-time data processing; Lambda Architecture; Streaming k-means; Random Decision Forest; Dimension reduction;
D O I
10.1186/s40537-019-0261-9
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
IntroductionThis paper presents a lifelong learning framework which constantly adapts with changing data patterns over time through incremental learning approach. In many big data systems, iterative re-training high dimensional data from scratch is computationally infeasible since constant data stream ingestion on top of a historical data pool increases the training time exponentially. Therefore, the need arises on how to retain past learning and fast update the model incrementally based on the new data. Also, the current machine learning approaches do the model prediction without providing a comprehensive root cause analysis. To resolve these limitations, our framework lays foundations on an ensemble process between stream data with historical batch data for an incremental lifelong learning (LML) model.Case descriptionA cancer patient's pathological tests like blood, DNA, urine or tissue analysis provide a unique signature based on the DNA combinations. Our analysis allows personalized and targeted medications and achieves a therapeutic response. Model is evaluated through data from The National Cancer Institute's Genomic Data Commons unified data repository. The aim is to prescribe personalized medicine based on the thousands of genotype and phenotype parameters for each patient.Discussion and evaluationThe model uses a dimension reduction method to reduce training time at an online sliding window setting. We identify the Gleason score as a determining factor for cancer possibility and substantiate our claim through Lilliefors and Kolmogorov-Smirnov test. We present clustering and Random Decision Forest results. The model's prediction accuracy is compared with standard machine learning algorithms for numeric and categorical fields.ConclusionWe propose an ensemble framework of stream and batch data for incremental lifelong learning. The framework successively applies first streaming clustering technique and then Random Decision Forest Regressor/Classifier to isolate anomalous patient data and provides reasoning through root cause analysis by feature correlations with an aim to improve the overall survival rate. While the stream clustering technique creates groups of patient profiles, RDF further drills down into each group for comparison and reasoning for useful actionable insights. The proposed MALA architecture retains the past learned knowledge and transfer to future learning and iteratively becomes more knowledgeable over time.
引用
收藏
页数:29
相关论文
共 50 条
  • [1] Lifelong Machine Learning and root cause analysis for large-scale cancer patient data
    Gautam Pal
    Xianbin Hong
    Zhuo Wang
    Hongyi Wu
    Gangmin Li
    Katie Atkinson
    [J]. Journal of Big Data, 6
  • [2] Large-Scale Machine Learning and Optimization for Bioinformatics Data Analysis
    Cheng, Jianlin
    [J]. ACM-BCB 2020 - 11TH ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY, AND HEALTH INFORMATICS, 2020,
  • [3] Deep learning for the large-scale cancer data analysis
    Tsuji, Shingo
    Aburatani, Hiroyuki
    [J]. CANCER RESEARCH, 2015, 75 (22)
  • [4] An online conjugate gradient algorithm for large-scale data analysis in machine learning
    Xue, Wei
    Wan, Pengcheng
    Li, Qiao
    Zhong, Ping
    Yu, Gaohang
    Tao, Tao
    [J]. AIMS MATHEMATICS, 2021, 6 (02): : 1515 - 1537
  • [5] A large-scale machine learning analysis of inorganic nanoparticles in preclinical cancer research
    Mendes, Barbara B.
    Zhang, Zilu
    Conniot, Joao
    Sousa, Diana P.
    Ravasco, Joao M. J. M.
    Onweller, Lauren A.
    Lorenc, Andzelika
    Rodrigues, Tiago
    Reker, Daniel
    Conde, Joao
    [J]. NATURE NANOTECHNOLOGY, 2024, 19 (06) : 867 - 878
  • [6] Security of NVMe Offloaded Data in Large-Scale Machine Learning
    Krauss, Torsten
    Goetz, Raphael
    Dmitrienko, Alexandra
    [J]. COMPUTER SECURITY - ESORICS 2023, PT IV, 2024, 14347 : 143 - 163
  • [7] A machine learning software for large-scale molecular and clinical data
    Pan, L.
    Mikolajczyk, K.
    Dimitrakopoulou-Strauss, A.
    Burger, C.
    Strauss, L.
    [J]. EUROPEAN JOURNAL OF NUCLEAR MEDICINE AND MOLECULAR IMAGING, 2007, 34 : S343 - S343
  • [8] Large-Scale Machine Learning Algorithms for Biomedical Data Science
    Huang, Heng
    [J]. ACM-BCB'19: PROCEEDINGS OF THE 10TH ACM INTERNATIONAL CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND HEALTH INFORMATICS, 2019, : 4 - 4
  • [9] Distantly Supervised Lifelong Learning for Large-Scale Social Media Sentiment Analysis
    Xia, Rui
    Jiang, Jie
    He, Huihui
    [J]. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2017, 8 (04) : 480 - 491
  • [10] Large-Scale Analysis of Genetic and Clinical Patient Data
    Ritchie, Marylyn D.
    [J]. ANNUAL REVIEW OF BIOMEDICAL DATA SCIENCE, VOL 1, 2018, 1 : 263 - 274