Multi-Resource Interleaving for Deep Learning Training

被引:18
|
作者
Zhao, Yihao [1 ,3 ]
Liu, Yuanqiang [1 ,3 ]
Peng, Yanghua [2 ]
Zhu, Yibo [2 ]
Liu, Xuanzhe [1 ,3 ]
Jin, Xin [1 ,3 ]
机构
[1] Peking Univ, Beijing, Peoples R China
[2] ByteDance Inc, Beijing, Peoples R China
[3] Peking Univ, Minist Educ, Key Lab High Confidence Software Technol, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Resource sharing; deep learning;
D O I
10.1145/3544216.3544224
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Training Deep Learning (DL) model requires multiple resource types, including CPUs, GPUs, storage IO, and network IO. Advancements in DL have produced a wide spectrum of models that have diverse usage patterns on different resource types. Existing DL schedulers focus on only GPU allocation, while missing the opportunity of packing jobs along multiple resource types. We present Muri, a multi-resource cluster scheduler for DL workloads. Muri exploits multi-resource interleaving of DL training jobs to achieve high resource utilization and reduce job completion time (JCT). DL jobs have a unique staged, iterative computation pattern. In contrast to multi-resource schedulers for big data workloads that pack jobs in the space dimension, Muri leverages this unique pattern to interleave jobs on the same set of resources in the time dimension. Muri adapts Blossom algorithm to find the perfect grouping plan for single-GPU jobs with two resource types, and generalizes the algorithm to handle multi-GPU jobs with more than two types. We build a prototype of Muri and integrate it with PyTorch. Experiments on a cluster with 64 GPUs demonstrate that Muri improves the average JCT by up to 3.6x and the makespan by up to 1.6x over existing DL schedulers.
引用
收藏
页码:428 / 440
页数:13
相关论文
共 50 条
  • [1] Multi-resource interleaving for task scheduling in cloud-edge system by deep reinforcement learning
    Pei, Xinglong
    Sun, Penghao
    Hu, Yuxiang
    Li, Dan
    Tian, Le
    Li, Ziyong
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2024, 160 : 522 - 536
  • [2] Multi-Resource Scheduling for Multiple Service Function Chains with Deep Reinforcement Learning
    He, Rui
    Ren, Bangbang
    Xie, Junjie
    Guo, Deke
    Zhao, Laiping
    [J]. 2022 IEEE 28TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS, ICPADS, 2022, : 665 - 672
  • [3] Learning Workflow Scheduling on Multi-Resource Clusters
    Hu, Yang
    de Laat, Cees
    Zhao, Zhiming
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON NETWORKING, ARCHITECTURE AND STORAGE (NAS), 2019, : 17 - 24
  • [4] Multi-resource shop scheduling with resource flexibility
    Dauzere-Peres, S
    Roux, W
    Lasserre, JB
    [J]. EUROPEAN JOURNAL OF OPERATIONAL RESEARCH, 1998, 107 (02) : 289 - 305
  • [5] Optimal resource leveling of multi-resource projects
    Younis, MA
    Saad, B
    [J]. COMPUTERS & INDUSTRIAL ENGINEERING, 1996, 31 (1-2) : 1 - 4
  • [6] Multi-Resource Allocation for On-Device Distributed Federated Learning Systems
    Gao, Yulan
    Ye, Ziqiang
    Yu, Han
    Xiong, Zehui
    Xiao, Yue
    Niyato, Dusit
    [J]. 2022 IEEE GLOBAL COMMUNICATIONS CONFERENCE (GLOBECOM 2022), 2022, : 160 - 165
  • [7] Multi-Resource Peer Assisted Learning in Postgraduate Setting: A Pilot Study
    Ali, Asif
    Evans, Phillip
    [J]. JCPSP-JOURNAL OF THE COLLEGE OF PHYSICIANS AND SURGEONS PAKISTAN, 2013, 23 (04): : 251 - 256
  • [8] Learning With Side Information: Elastic Multi-Resource Control for the Open RAN
    Zhang, Xiaoxi
    Zuo, Jinhang
    Huang, Zhe
    Zhou, Zhi
    Chen, Xu
    Joe-Wong, Carlee
    [J]. IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, 2024, 42 (02) : 295 - 309
  • [9] Altruistic Scheduling in Multi-Resource Clusters
    Grandl, Robert
    Chowdhury, Mosharaf
    Akella, Aditya
    Ananthanarayanan, Ganesh
    [J]. PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, 2016, : 65 - 80
  • [10] Multi-Resource Packing for Cluster Schedulers
    Grandl, Robert
    Ananthanarayanan, Ganesh
    Kandula, Srikanth
    Rao, Sriram
    Akella, Aditya
    [J]. ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2014, 44 (04) : 455 - 466