A container-based workflow for distributed training of deep learning algorithms in HPC clusters

被引：1

作者：

Gonzalez-Abad, Jose ^{[1
]}

Lopez Garcia, Alvaro ^{[1
]}

Kozlov, Valentin Y. ^{[2
]}

机构：

[1] Univ Cantabria, Inst Fis Cantabria IFCA, CSIC, Santander, Spain

[2] Karlsruhe Inst Technol KIT, Steinbuch Ctr Comp SCC, Karlsruhe, Germany

来源：

CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS | 2023年 / 26卷 / 05期

关键词：

Distributed training; Deep learning; High performance computing; udocker; Docker; Horovod; CONFIGURATION;

D O I：

10.1007/s10586-022-03798-7

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Deep learning has been postulated as a solution for numerous problems in different branches of science. Given the resource-intensive nature of these models, they often need to be executed on specialized hardware such graphical processing units (GPUs) in a distributed manner. In the academic field, researchers get access to this kind of resources through High Performance Computing (HPC) clusters. This kind of infrastructures make the training of these models difficult due to their multi-user nature and limited user permission. In addition, different HPC clusters may possess different peculiarities that can entangle the research cycle (e.g., libraries dependencies). In this paper we develop a workflow and methodology for the distributed training of deep learning models in HPC clusters which provides researchers with a series of novel advantages. It relies on udocker as containerization tool and on Horovod as library for the distribution of the models across multiple GPUs. udocker does not need any special permission, allowing researchers to run the entire workflow without relying on any administrator. Horovod ensures the efficient distribution of the training independently of the deep learning framework used. Additionally, due to containerization and specific features of the workflow, it provides researchers with a cluster-agnostic way of running their models. The experiments carried out show that the workflow offers good scalability in the distributed training of the models and that it easily adapts to different clusters.

引用

页码：2815 / 2834

页数：20

共 50 条

[1] A container-based workflow for distributed training of deep learning algorithms in HPC clusters
Jose González-Abad
Álvaro López García
Valentin Y. Kozlov
[J]. Cluster Computing, 2023, 26 : 2815 - 2834
[2] DRPC: Distributed Reinforcement Learning Approach for Scalable Resource Provisioning in Container-Based Clusters
Bai, Haoyu
Xu, Minxian
Ye, Kejiang
Buyya, Rajkumar
Xu, Chengzhong
[J]. IEEE Transactions on Services Computing, 2024, 17 (06): : 3473 - 3484
[3] High Performance MPI Library for Container-based HPC Cloud on InfiniBand Clusters
Zhang, Jie
Lu, Xiaoyi
Panda, Dhabaleswar K.
[J]. PROCEEDINGS 45TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING - ICPP 2016, 2016, : 268 - 277
[4] Providing Security in Container-Based HPC Runtime Environments
Gantikow, Holger
Reich, Christoph
Knahl, Martin
Clarke, Nathan
[J]. HIGH PERFORMANCE COMPUTING, ISC HIGH PERFORMANCE 2016 INTERNATIONAL WORKSHOPS, 2016, 9945 : 685 - 695
[5] Container-based virtual elastic clusters
de Alfonso, Carlos
Calatrava, Amanda
Molto, German
[J]. JOURNAL OF SYSTEMS AND SOFTWARE, 2017, 127 : 1 - 11
[6] Proposal of Container-Based HPC Structures and Performance Analysis
Yong, Chanho
Lee, Go-Won
Huh, Eui-Nam
[J]. JOURNAL OF INFORMATION PROCESSING SYSTEMS, 2018, 14 (06): : 1398 - 1404
[7] Accelerating Container-based Deep Learning Hyperparameter Optimization Workloads
Liu, Rui
Wong, David
Lange, Dave
Larsson, Patrik
Jethava, Vinay
Zheng, Qing
[J]. PROCEEDINGS OF THE 6TH WORKSHOP ON DATA MANAGEMENT FOR END-TO-END MACHINE LEARNING, DEEM 2022, 2022,
[8] Minimizing Communication Overheads in Container-based Clouds for HPC Applications
Maliszewski, Anderson M.
Vogel, Adriano
Griebler, Dalvan
Roloff, Eduardo
Fernandes, Luiz G.
Navaux, Philippe O. A.
[J]. 2019 IEEE SYMPOSIUM ON COMPUTERS AND COMMUNICATIONS (ISCC), 2019, : 474 - 479
[9] Autoscaling recovery actions for container-based clusters
Samir, Areeg
Pahl, Claus
[J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2021, 33 (23):
[10] Performance Characterization of Hypervisor- and Container-based Virtualization for HPC on SR-IOV Enabled InfiniBand Clusters
Zhang, Jie
Lu, Xiaoyi
Panda, Dhabaleswar K.
[J]. 2016 IEEE 30TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2016, : 1777 - 1784

← 1 2 3 4 5 →