Reproducibility of Training Deep Learning Models for Medical Image Analysis

被引:0
|
作者
Bosma, Joeran Sander [1 ]
Peeters, Dre [1 ]
Alves, Natalia [1 ]
Saha, Anindo [1 ]
Saghir, Zaigham [2 ]
Jacobs, Colin [1 ]
Huisman, Henkjan [1 ]
机构
[1] Radboud Univ Nijmegen, Ctr Med, Diagnost Image Anal Grp, Dept Med Imaging, NL-6525 GA Nijmegen, Netherlands
[2] Herlev Gentofte Hosp, Sect Pulm Med, Dept Med, Hellerup, Denmark
关键词
Deep learning; reproducibility; medical image analysis; performance variance;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Performance of deep learning algorithms varies due to their development data and training method, but also due to several stochastic processes during training. Due to these random factors, a single training run may not accurately reflect the performance of a given training method. Statistical comparisons in literature between different deep learning training methods typically ignore this performance variation between training runs and incorrectly claim significance of changes in training method. We hypothesize that the impact of such performance variation is substantial, such that it may invalidate biomedical competition leaderboards and some scientific papers. To test this, we investigate the reproducibility of training deep learning algorithms for medical image analysis. We repeated training runs from prior scientific studies: three diagnostic tasks (pancreatic cancer detection in CT, clinically significant prostate cancer detection in MRI, and lung nodule malignancy risk estimation in low-dose CT) and two organ segmentation tasks (pancreas segmentation in CT and prostate segmentation in MRI). A previously published top-performing algorithm for each task was trained multiple times to determine the variance in model performance. For all three diagnostic algorithms, performance variation from retraining was significant compared to data variance. Statistically comparing independently trained algorithms from the same training method using the same dataset should follow the null hypothesis, but we observed claimed significance with a p-value below 0.05 in 15% of comparisons with conventional testing (paired bootstrapping). We conclude that variance in model performance due to retraining is substantial and should be accounted for.
引用
收藏
页码:1269 / 1287
页数:19
相关论文
共 50 条
  • [1] Reproducibility of Training Deep Learning Models for Medical Image Analysis
    Bosma, Joeran Sander
    Peeters, Dré
    Alves, Natália
    Saha, Anindo
    Saghir, Zaigham
    Jacobs, Colin
    Huisman, Henkjan
    Proceedings of Machine Learning Research, 2023, 227 : 1269 - 1287
  • [2] Carbon Footprint of Selecting and Training Deep Learning Models for Medical Image Analysis
    Selvan, Raghavendra
    Bhagwat, Nikhil
    Anthony, Lasse F. Wolff
    Kanding, Benjamin
    Dam, Erik B.
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2022, PT V, 2022, 13435 : 506 - 516
  • [3] Deep learning models in medical image analysis
    Tsuneki, Masayuki
    JOURNAL OF ORAL BIOSCIENCES, 2022, 64 (03) : 312 - 320
  • [4] A Survey of Deep Learning Models for Medical Image Analysis
    Umer, Mohammad
    Sharma, Shilpa
    Rattan, Punam
    2021 INTERNATIONAL CONFERENCE ON COMPUTING SCIENCES (ICCS 2021), 2021, : 65 - 69
  • [5] Explainable Deep Learning Models in Medical Image Analysis
    Singh, Amitojdeep
    Sengupta, Sourya
    Lakshminarayanan, Vasudevan
    JOURNAL OF IMAGING, 2020, 6 (06)
  • [6] Variability and reproducibility in deep learning for medical image segmentation
    Renard, Felix
    Guedria, Soulaimane
    De Palma, Noel
    Vuillerme, Nicolas
    SCIENTIFIC REPORTS, 2020, 10 (01)
  • [7] Variability and reproducibility in deep learning for medical image segmentation
    Félix Renard
    Soulaimane Guedria
    Noel De Palma
    Nicolas Vuillerme
    Scientific Reports, 10
  • [8] Training calibration-based counterfactual explainers for deep learning models in medical image analysis
    Thiagarajan, Jayaraman J.
    Thopalli, Kowshik
    Rajan, Deepta
    Turaga, Pavan
    SCIENTIFIC REPORTS, 2022, 12 (01)
  • [9] Training calibration-based counterfactual explainers for deep learning models in medical image analysis
    Jayaraman J. Thiagarajan
    Kowshik Thopalli
    Deepta Rajan
    Pavan Turaga
    Scientific Reports, 12
  • [10] Deep Learning Models for Medical Image Analysis: Challenges and Future Directions
    Agrawal, R. K.
    Juneja, Akanksha
    BIG DATA ANALYTICS (BDA 2019), 2019, 11932 : 20 - 32