Multi-Resource Interleaving for Deep Learning Training

被引：18

作者：

Zhao, Yihao ^{[1
,3
]}

Liu, Yuanqiang ^{[1
,3
]}

Peng, Yanghua ^{[2
]}

Zhu, Yibo ^{[2
]}

Liu, Xuanzhe ^{[1
,3
]}

Jin, Xin ^{[1
,3
]}

机构：

[1] Peking Univ, Beijing, Peoples R China

[2] ByteDance Inc, Beijing, Peoples R China

[3] Peking Univ, Minist Educ, Key Lab High Confidence Software Technol, Beijing, Peoples R China

来源：

SIGCOMM '22: PROCEEDINGS OF THE 2022 ACM SIGCOMM 2022 CONFERENCE | 2022年

基金：

中国国家自然科学基金;

关键词：

Resource sharing; deep learning;

D O I：

10.1145/3544216.3544224

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Training Deep Learning (DL) model requires multiple resource types, including CPUs, GPUs, storage IO, and network IO. Advancements in DL have produced a wide spectrum of models that have diverse usage patterns on different resource types. Existing DL schedulers focus on only GPU allocation, while missing the opportunity of packing jobs along multiple resource types. We present Muri, a multi-resource cluster scheduler for DL workloads. Muri exploits multi-resource interleaving of DL training jobs to achieve high resource utilization and reduce job completion time (JCT). DL jobs have a unique staged, iterative computation pattern. In contrast to multi-resource schedulers for big data workloads that pack jobs in the space dimension, Muri leverages this unique pattern to interleave jobs on the same set of resources in the time dimension. Muri adapts Blossom algorithm to find the perfect grouping plan for single-GPU jobs with two resource types, and generalizes the algorithm to handle multi-GPU jobs with more than two types. We build a prototype of Muri and integrate it with PyTorch. Experiments on a cluster with 64 GPUs demonstrate that Muri improves the average JCT by up to 3.6x and the makespan by up to 1.6x over existing DL schedulers.

引用

页码：428 / 440

页数：13

共 50 条

[41] Multi-resource allocation in stochastic project scheduling
Wolfram Wiesemann
Daniel Kuhn
Berç Rustem
[J]. Annals of Operations Research, 2012, 193 : 193 - 220
[42] Equilibrium characterizations of multi-resource Lotto games
Aghajan, Adel
Paarporn, Keith
Marden, Jason R.
[J]. IFAC PAPERSONLINE, 2023, 56 (02): : 2805 - 2810
[43] A Dynamic Multi-Resource Management for Edge Computing
Chuang, I-Hsun
Sun, Rong-Chen
Tsai, Hsiang-Jen
Horng, Mong-Fong
Kuo, Yau-Hwang
[J]. 2019 EUROPEAN CONFERENCE ON NETWORKS AND COMMUNICATIONS (EUCNC), 2019, : 379 - 383
[44] LOMARC - Lookahead matchmaking for multi-resource coscheduling
Sodan, AC
Lan, L
[J]. JOB SCHEDULING STRATEGIES FOR PARALLEL PROCESSING, 2005, 3277 : 288 - 315
[45] Multi-Resource VNF Deployment in a Heterogeneous Cloud
Zheng, Jiaqi
Zhang, Zixuan
Ma, Qiufang
Gao, Xiaofeng
Tian, Chen
Chen, Guihai
[J]. IEEE TRANSACTIONS ON COMPUTERS, 2022, 71 (01) : 81 - 91
[46] Multi-resource Aware Fairsharing for Heterogeneous Systems
Klusacek, Dalibor
Rudova, Hana
[J]. JOB SCHEDULING STRATEGIES FOR PARALLEL PROCESSING (JSSPP 2014), 2015, 8828 : 53 - 69
[47] Multi-resource Minority Games: Redefining the Game
Romero, Daniel
Shinseki, Elissa
Seyednezhad, S. M. Mahdi
Menezes, Ronaldo
[J]. PROCEEDINGS OF SAI INTELLIGENT SYSTEMS CONFERENCE (INTELLISYS) 2016, VOL 2, 2018, 16 : 186 - 203
[48] A multi-resource scheduling scheme of Kubernetes for IIoT
ZHU Lin
LI Junjiang
LIU Zijie
ZHANG Dengyin
[J]. Journal of Systems Engineering and Electronics, 2022, 33 (03) : 683 - 692
[49] Adaptive multi-resource prediction in distributed resource sharing environment
Liang, J
Nahrstedt, K
Zhou, YY
[J]. 2004 IEEE INTERNATIONAL SYMPOSIUM ON CLUSTER COMPUTING AND THE GRID - CCGRID 2004, 2004, : 293 - 300
[50] Planning and Online Resource Allocation for the Multi-Resource Cloud Infrastructure
Wang, Xue
Razo, Miguel
Tacca, Marco
Fumagalli, Andrea
[J]. 2014 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC), 2014, : 2938 - 2943

← 1 2 3 4 5 →