Multi-Resource Interleaving for Deep Learning Training

被引:18
|
作者
Zhao, Yihao [1 ,3 ]
Liu, Yuanqiang [1 ,3 ]
Peng, Yanghua [2 ]
Zhu, Yibo [2 ]
Liu, Xuanzhe [1 ,3 ]
Jin, Xin [1 ,3 ]
机构
[1] Peking Univ, Beijing, Peoples R China
[2] ByteDance Inc, Beijing, Peoples R China
[3] Peking Univ, Minist Educ, Key Lab High Confidence Software Technol, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Resource sharing; deep learning;
D O I
10.1145/3544216.3544224
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Training Deep Learning (DL) model requires multiple resource types, including CPUs, GPUs, storage IO, and network IO. Advancements in DL have produced a wide spectrum of models that have diverse usage patterns on different resource types. Existing DL schedulers focus on only GPU allocation, while missing the opportunity of packing jobs along multiple resource types. We present Muri, a multi-resource cluster scheduler for DL workloads. Muri exploits multi-resource interleaving of DL training jobs to achieve high resource utilization and reduce job completion time (JCT). DL jobs have a unique staged, iterative computation pattern. In contrast to multi-resource schedulers for big data workloads that pack jobs in the space dimension, Muri leverages this unique pattern to interleave jobs on the same set of resources in the time dimension. Muri adapts Blossom algorithm to find the perfect grouping plan for single-GPU jobs with two resource types, and generalizes the algorithm to handle multi-GPU jobs with more than two types. We build a prototype of Muri and integrate it with PyTorch. Experiments on a cluster with 64 GPUs demonstrate that Muri improves the average JCT by up to 3.6x and the makespan by up to 1.6x over existing DL schedulers.
引用
收藏
页码:428 / 440
页数:13
相关论文
共 50 条
  • [41] Multi-resource allocation in stochastic project scheduling
    Wolfram Wiesemann
    Daniel Kuhn
    Berç Rustem
    [J]. Annals of Operations Research, 2012, 193 : 193 - 220
  • [42] Equilibrium characterizations of multi-resource Lotto games
    Aghajan, Adel
    Paarporn, Keith
    Marden, Jason R.
    [J]. IFAC PAPERSONLINE, 2023, 56 (02): : 2805 - 2810
  • [43] A Dynamic Multi-Resource Management for Edge Computing
    Chuang, I-Hsun
    Sun, Rong-Chen
    Tsai, Hsiang-Jen
    Horng, Mong-Fong
    Kuo, Yau-Hwang
    [J]. 2019 EUROPEAN CONFERENCE ON NETWORKS AND COMMUNICATIONS (EUCNC), 2019, : 379 - 383
  • [44] LOMARC - Lookahead matchmaking for multi-resource coscheduling
    Sodan, AC
    Lan, L
    [J]. JOB SCHEDULING STRATEGIES FOR PARALLEL PROCESSING, 2005, 3277 : 288 - 315
  • [45] Multi-Resource VNF Deployment in a Heterogeneous Cloud
    Zheng, Jiaqi
    Zhang, Zixuan
    Ma, Qiufang
    Gao, Xiaofeng
    Tian, Chen
    Chen, Guihai
    [J]. IEEE TRANSACTIONS ON COMPUTERS, 2022, 71 (01) : 81 - 91
  • [46] Multi-resource Aware Fairsharing for Heterogeneous Systems
    Klusacek, Dalibor
    Rudova, Hana
    [J]. JOB SCHEDULING STRATEGIES FOR PARALLEL PROCESSING (JSSPP 2014), 2015, 8828 : 53 - 69
  • [47] Multi-resource Minority Games: Redefining the Game
    Romero, Daniel
    Shinseki, Elissa
    Seyednezhad, S. M. Mahdi
    Menezes, Ronaldo
    [J]. PROCEEDINGS OF SAI INTELLIGENT SYSTEMS CONFERENCE (INTELLISYS) 2016, VOL 2, 2018, 16 : 186 - 203
  • [48] A multi-resource scheduling scheme of Kubernetes for IIoT
    ZHU Lin
    LI Junjiang
    LIU Zijie
    ZHANG Dengyin
    [J]. Journal of Systems Engineering and Electronics, 2022, 33 (03) : 683 - 692
  • [49] Adaptive multi-resource prediction in distributed resource sharing environment
    Liang, J
    Nahrstedt, K
    Zhou, YY
    [J]. 2004 IEEE INTERNATIONAL SYMPOSIUM ON CLUSTER COMPUTING AND THE GRID - CCGRID 2004, 2004, : 293 - 300
  • [50] Planning and Online Resource Allocation for the Multi-Resource Cloud Infrastructure
    Wang, Xue
    Razo, Miguel
    Tacca, Marco
    Fumagalli, Andrea
    [J]. 2014 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC), 2014, : 2938 - 2943