Liquid: Intelligent Resource Estimation and Network-Efficient Scheduling for Deep Learning Jobs on Distributed GPU Clusters

被引:51
|
作者
Gu, Rong [1 ]
Chen, Yuquan [1 ]
Liu, Shuai [1 ]
Dai, Haipeng [1 ]
Chen, Guihai
Zhang, Kai [2 ]
Che, Yang [1 ,2 ]
Huang, Yihua [1 ]
机构
[1] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing 210023, Jiangsu, Peoples R China
[2] Alibaba Grp, Hangzhou 311121, Zhejiang, Peoples R China
基金
美国国家科学基金会;
关键词
Graphics processing units; Processor scheduling; Resource management; Estimation; Liquids; Optimization; Training; Job scheduling; resource management; deep learning; GPU clusters;
D O I
10.1109/TPDS.2021.3138825
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Deep learning (DL) is becoming increasingly popular in many domains, including computer vision, speech recognition, self-driving automobiles, etc. GPU can train DL models efficiently but is expensive, which motivates users to share GPU resource to reduce money costs in practice. To ensure efficient sharing among multiple users, it is necessary to develop efficient GPU resource management and scheduling solutions. However, existing ones have several shortcomings. First, they require the users to specify the job resource requirement which is usually quite inaccurate and leads to cluster resource underutilization. Second, when scheduling DL jobs, they rarely take the cluster network characteristics into consideration, resulting in low job execution performance. To overcome the above issues, we propose Liquid, an efficient GPU resource management platform for DL jobs with intelligent resource requirement estimation and scheduling. First, we propose a regression model based method for job resource requirement estimation to avoid users over-allocating computing resources. Second, we propose intelligent cluster network-efficient scheduling methods in both immediate and batch modes based on the above resource requirement estimation techniques. Third, we further propose three system-level optimizations, including pre-scheduling data transmission, fine-grained GPU sharing, and event-driven communication. Experimental results show that our Liquid can accelerate the job execution speed by 18% on average and shorten the average job completion time (JCT) by 21% compared with cutting-edge solutions. Moreover, the proposed optimization methods are effective in various scenarios.
引用
收藏
页码:2808 / 2820
页数:13
相关论文
共 50 条
  • [1] DLTAP: A Network-efficient Scheduling Method for Distributed Deep Learning Workload in Containerized Cluster Environment
    Qiao, Wei
    Li, Ying
    Wu, Zhong-Hai
    4TH ANNUAL INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY AND APPLICATIONS (ITA 2017), 2017, 12
  • [2] Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters via Wise Resource Sharing
    Luo, Yizhou
    Wang, Qiang
    Shi, Shaohuai
    Lai, Jiaxin
    Qi, Shuhan
    Zhang, Jiajia
    Wang, Xuan
    2024 IEEE/ACM 32ND INTERNATIONAL SYMPOSIUM ON QUALITY OF SERVICE, IWQOS, 2024,
  • [3] Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters
    Zhang, Hao
    Zheng, Zeyu
    Xu, Shizhen
    Dai, Wei
    Hoe, Qirong
    Liang, Xiaodan
    Hu, Zhiting
    Weil, Jinliang
    Xie, Pengtao
    Xing, Eric P.
    2017 USENIX ANNUAL TECHNICAL CONFERENCE (USENIX ATC '17), 2017, : 181 - 193
  • [4] Scheduling CPU for GPU-based Deep Learning Jobs
    Xiao, Wencong
    Han, Zhenhua
    Zhao, Hanyu
    Peng, Xuan
    Zhang, Quanlu
    Yang, Fan
    Zhou, Lidong
    PROCEEDINGS OF THE 2018 ACM SYMPOSIUM ON CLOUD COMPUTING (SOCC '18), 2018, : 503 - 503
  • [5] PickyMan: A Preemptive Scheduler for Deep Learning Jobs on GPU Clusters
    Chen, Chen
    Chen, Yingwen
    Chen, Zhaoyun
    Han, Jianchen
    Xue, Guangtao
    2022 IEEE INTERNATIONAL PERFORMANCE, COMPUTING, AND COMMUNICATIONS CONFERENCE, IPCCC, 2022,
  • [6] Cooperative Distributed GPU Power Capping for Deep Learning Clusters
    Kang, Dong-Ki
    Ha, Yun-Gi
    Peng, Limei
    Youn, Chan-Hyun
    IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, 2022, 69 (07) : 7244 - 7254
  • [7] Poster Abstract: Deep Learning Workloads Scheduling with Reinforcement Learning on GPU Clusters
    Chen, Zhaoyun
    Luo, Lei
    Quan, Wei
    Wen, Mei
    Zhang, Chunyuan
    IEEE CONFERENCE ON COMPUTER COMMUNICATIONS WORKSHOPS (IEEE INFOCOM 2019 WKSHPS), 2019, : 1023 - 1024
  • [8] GPARS: Graph predictive algorithm for efficient resource scheduling in heterogeneous GPU clusters
    Wang, Sheng
    Chen, Shiping
    Shi, Yumei
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2024, 152 : 127 - 137
  • [9] Voda: A GPU Scheduling Platform for Elastic Deep Learning in Kubernetes Clusters
    Hsieh, Tsung-Tso
    Lee, Che-Rung
    2023 IEEE INTERNATIONAL CONFERENCE ON CLOUD ENGINEERING, IC2E, 2023, : 131 - 140
  • [10] Scheduling Distributed Deep Learning Jobs in Heterogeneous Cluster with Placement Awareness
    Li, Qingping
    Xu, Jingwei
    Cao, Chun
    THE 12TH ASIA-PACIFIC SYMPOSIUM ON INTERNETWARE, INTERNETWARE 2020, 2021, : 217 - 228