A container-based workflow for distributed training of deep learning algorithms in HPC clusters

被引:1
|
作者
Gonzalez-Abad, Jose [1 ]
Lopez Garcia, Alvaro [1 ]
Kozlov, Valentin Y. [2 ]
机构
[1] Univ Cantabria, Inst Fis Cantabria IFCA, CSIC, Santander, Spain
[2] Karlsruhe Inst Technol KIT, Steinbuch Ctr Comp SCC, Karlsruhe, Germany
关键词
Distributed training; Deep learning; High performance computing; udocker; Docker; Horovod; CONFIGURATION;
D O I
10.1007/s10586-022-03798-7
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Deep learning has been postulated as a solution for numerous problems in different branches of science. Given the resource-intensive nature of these models, they often need to be executed on specialized hardware such graphical processing units (GPUs) in a distributed manner. In the academic field, researchers get access to this kind of resources through High Performance Computing (HPC) clusters. This kind of infrastructures make the training of these models difficult due to their multi-user nature and limited user permission. In addition, different HPC clusters may possess different peculiarities that can entangle the research cycle (e.g., libraries dependencies). In this paper we develop a workflow and methodology for the distributed training of deep learning models in HPC clusters which provides researchers with a series of novel advantages. It relies on udocker as containerization tool and on Horovod as library for the distribution of the models across multiple GPUs. udocker does not need any special permission, allowing researchers to run the entire workflow without relying on any administrator. Horovod ensures the efficient distribution of the training independently of the deep learning framework used. Additionally, due to containerization and specific features of the workflow, it provides researchers with a cluster-agnostic way of running their models. The experiments carried out show that the workflow offers good scalability in the distributed training of the models and that it easily adapts to different clusters.
引用
收藏
页码:2815 / 2834
页数:20
相关论文
共 50 条
  • [1] A container-based workflow for distributed training of deep learning algorithms in HPC clusters
    Jose González-Abad
    Álvaro López García
    Valentin Y. Kozlov
    [J]. Cluster Computing, 2023, 26 : 2815 - 2834
  • [2] DRPC: Distributed Reinforcement Learning Approach for Scalable Resource Provisioning in Container-Based Clusters
    Bai, Haoyu
    Xu, Minxian
    Ye, Kejiang
    Buyya, Rajkumar
    Xu, Chengzhong
    [J]. IEEE Transactions on Services Computing, 2024, 17 (06): : 3473 - 3484
  • [3] High Performance MPI Library for Container-based HPC Cloud on InfiniBand Clusters
    Zhang, Jie
    Lu, Xiaoyi
    Panda, Dhabaleswar K.
    [J]. PROCEEDINGS 45TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING - ICPP 2016, 2016, : 268 - 277
  • [4] Providing Security in Container-Based HPC Runtime Environments
    Gantikow, Holger
    Reich, Christoph
    Knahl, Martin
    Clarke, Nathan
    [J]. HIGH PERFORMANCE COMPUTING, ISC HIGH PERFORMANCE 2016 INTERNATIONAL WORKSHOPS, 2016, 9945 : 685 - 695
  • [5] Container-based virtual elastic clusters
    de Alfonso, Carlos
    Calatrava, Amanda
    Molto, German
    [J]. JOURNAL OF SYSTEMS AND SOFTWARE, 2017, 127 : 1 - 11
  • [6] Proposal of Container-Based HPC Structures and Performance Analysis
    Yong, Chanho
    Lee, Go-Won
    Huh, Eui-Nam
    [J]. JOURNAL OF INFORMATION PROCESSING SYSTEMS, 2018, 14 (06): : 1398 - 1404
  • [7] Accelerating Container-based Deep Learning Hyperparameter Optimization Workloads
    Liu, Rui
    Wong, David
    Lange, Dave
    Larsson, Patrik
    Jethava, Vinay
    Zheng, Qing
    [J]. PROCEEDINGS OF THE 6TH WORKSHOP ON DATA MANAGEMENT FOR END-TO-END MACHINE LEARNING, DEEM 2022, 2022,
  • [8] Minimizing Communication Overheads in Container-based Clouds for HPC Applications
    Maliszewski, Anderson M.
    Vogel, Adriano
    Griebler, Dalvan
    Roloff, Eduardo
    Fernandes, Luiz G.
    Navaux, Philippe O. A.
    [J]. 2019 IEEE SYMPOSIUM ON COMPUTERS AND COMMUNICATIONS (ISCC), 2019, : 474 - 479
  • [9] Autoscaling recovery actions for container-based clusters
    Samir, Areeg
    Pahl, Claus
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2021, 33 (23):
  • [10] Performance Characterization of Hypervisor- and Container-based Virtualization for HPC on SR-IOV Enabled InfiniBand Clusters
    Zhang, Jie
    Lu, Xiaoyi
    Panda, Dhabaleswar K.
    [J]. 2016 IEEE 30TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2016, : 1777 - 1784