Liquid: Intelligent Resource Estimation and Network-Efficient Scheduling for Deep Learning Jobs on Distributed GPU Clusters

被引：51

作者：

Gu, Rong ^{[1
]}

Chen, Yuquan ^{[1
]}

Liu, Shuai ^{[1
]}

Dai, Haipeng ^{[1
]}

Chen, Guihai

Zhang, Kai ^{[2
]}

Che, Yang ^{[1
,2
]}

Huang, Yihua ^{[1
]}

机构：

[1] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing 210023, Jiangsu, Peoples R China

[2] Alibaba Grp, Hangzhou 311121, Zhejiang, Peoples R China

来源：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS | 2022年 / 33卷 / 11期

基金：

美国国家科学基金会;

关键词：

Graphics processing units; Processor scheduling; Resource management; Estimation; Liquids; Optimization; Training; Job scheduling; resource management; deep learning; GPU clusters;

D O I：

10.1109/TPDS.2021.3138825

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Deep learning (DL) is becoming increasingly popular in many domains, including computer vision, speech recognition, self-driving automobiles, etc. GPU can train DL models efficiently but is expensive, which motivates users to share GPU resource to reduce money costs in practice. To ensure efficient sharing among multiple users, it is necessary to develop efficient GPU resource management and scheduling solutions. However, existing ones have several shortcomings. First, they require the users to specify the job resource requirement which is usually quite inaccurate and leads to cluster resource underutilization. Second, when scheduling DL jobs, they rarely take the cluster network characteristics into consideration, resulting in low job execution performance. To overcome the above issues, we propose Liquid, an efficient GPU resource management platform for DL jobs with intelligent resource requirement estimation and scheduling. First, we propose a regression model based method for job resource requirement estimation to avoid users over-allocating computing resources. Second, we propose intelligent cluster network-efficient scheduling methods in both immediate and batch modes based on the above resource requirement estimation techniques. Third, we further propose three system-level optimizations, including pre-scheduling data transmission, fine-grained GPU sharing, and event-driven communication. Experimental results show that our Liquid can accelerate the job execution speed by 18% on average and shorten the average job completion time (JCT) by 21% compared with cutting-edge solutions. Moreover, the proposed optimization methods are effective in various scenarios.

引用

页码：2808 / 2820

页数：13

共 50 条

[11] A fine-grained GPU sharing and job scheduling for deep learning jobs on the cloud
Chung, Wu-Chun
Tong, Jyun-Sen
Chen, Zhi-Hao
JOURNAL OF SUPERCOMPUTING, 2025, 81 (02):
[12] Crux: GPU-Efficient Communication Scheduling for Deep Learning Training
Cao, Jiamin
Guan, Yu
Qian, Kun
Gao, Jiaqi
Xiao, Wencong
Dong, Jianbo
Fu, Binzhang
Cai, Dennis
Zhai, Ennan
PROCEEDINGS OF THE 2024 ACM SIGCOMM 2024 CONFERENCE, ACM SIGCOMM 2024, 2024, : 1 - 15
[13] Benchmarking Resource Usage for Efficient Distributed Deep Learning
Frey, Nathan C.
Li, Baolin
McDonald, Joseph
Zhao, Dan
Jones, Michael
Bestor, David
Tiwari, Devesh
Gadepally, Vijay
Samsi, Siddharth
2022 IEEE HIGH PERFORMANCE EXTREME COMPUTING VIRTUAL CONFERENCE (HPEC), 2022,
[14] Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters
Peng, Yanghua
Bao, Yixin
Chen, Yangrui
Wu, Chuan
Guo, Chuanxiong
EUROSYS '18: PROCEEDINGS OF THE THIRTEENTH EUROSYS CONFERENCE, 2018,
[15] Deep Learning Research and Development Platform: Characterizing and Scheduling with QoS Guarantees on GPU Clusters
Chen, Zhaoyun
Quan, Wei
Wen, Mei
Fang, Jianbin
Yu, Jie
Zhang, Chunyuan
Luo, Lei
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2020, 31 (01) : 34 - 50
[16] A GPU Scheduling Framework to Accelerate Hyper-Parameter Optimization in Deep Learning Clusters
Son, Jaewon
Yoo, Yonghyuk
Kim, Khu-rai
Kim, Youngjae
Lee, Kwonyong
Park, Sungyong
ELECTRONICS, 2021, 10 (03) : 1 - 15
[17] Online Evolutionary Batch Size Orchestration for Scheduling Deep Learning Workloads in GPU Clusters
Bian, Zhengda
Li, Shenggui
Wang, Wei
You, Yang
SC21: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2021,
[18] TensorExpress: In-Network Communication Scheduling for Distributed Deep Learning
Kang, Minkoo
Yang, Gyeongsik
Yoo, Yeonho
Yoo, Chuck
2020 IEEE 13TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING (CLOUD 2020), 2020, : 25 - 27
[19] On Scheduling Ring-All-Reduce Learning Jobs in Multi-Tenant GPU Clusters with Communication Contention
Yu, Menglu
Ji, Bo
Rajan, Hridesh
Liu, Jia
PROCEEDINGS OF THE 2022 THE TWENTY-THIRD INTERNATIONAL SYMPOSIUM ON THEORY, ALGORITHMIC FOUNDATIONS, AND PROTOCOL DESIGN FOR MOBILE NETWORKS AND MOBILE COMPUTING, MOBIHOC 2022, 2022, : 21 - 30
[20] The Algorithms of Distributed Learning and Distributed Estimation about Intelligent Wireless Sensor Network
Tan, Fuxiao
SENSORS, 2020, 20 (05)

← 1 2 3 4 5 →