Liquid: Intelligent Resource Estimation and Network-Efficient Scheduling for Deep Learning Jobs on Distributed GPU Clusters

被引：51

作者：

Gu, Rong ^{[1
]}

Chen, Yuquan ^{[1
]}

Liu, Shuai ^{[1
]}

Dai, Haipeng ^{[1
]}

Chen, Guihai

Zhang, Kai ^{[2
]}

Che, Yang ^{[1
,2
]}

Huang, Yihua ^{[1
]}

机构：

[1] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing 210023, Jiangsu, Peoples R China

[2] Alibaba Grp, Hangzhou 311121, Zhejiang, Peoples R China

来源：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS | 2022年 / 33卷 / 11期

基金：

美国国家科学基金会;

关键词：

Graphics processing units; Processor scheduling; Resource management; Estimation; Liquids; Optimization; Training; Job scheduling; resource management; deep learning; GPU clusters;

D O I：

10.1109/TPDS.2021.3138825

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Deep learning (DL) is becoming increasingly popular in many domains, including computer vision, speech recognition, self-driving automobiles, etc. GPU can train DL models efficiently but is expensive, which motivates users to share GPU resource to reduce money costs in practice. To ensure efficient sharing among multiple users, it is necessary to develop efficient GPU resource management and scheduling solutions. However, existing ones have several shortcomings. First, they require the users to specify the job resource requirement which is usually quite inaccurate and leads to cluster resource underutilization. Second, when scheduling DL jobs, they rarely take the cluster network characteristics into consideration, resulting in low job execution performance. To overcome the above issues, we propose Liquid, an efficient GPU resource management platform for DL jobs with intelligent resource requirement estimation and scheduling. First, we propose a regression model based method for job resource requirement estimation to avoid users over-allocating computing resources. Second, we propose intelligent cluster network-efficient scheduling methods in both immediate and batch modes based on the above resource requirement estimation techniques. Third, we further propose three system-level optimizations, including pre-scheduling data transmission, fine-grained GPU sharing, and event-driven communication. Experimental results show that our Liquid can accelerate the job execution speed by 18% on average and shorten the average job completion time (JCT) by 21% compared with cutting-edge solutions. Moreover, the proposed optimization methods are effective in various scenarios.

引用

页码：2808 / 2820

页数：13

共 50 条

[41] DeepThings: Distributed Adaptive Deep Learning Inference on Resource-Constrained IoT Edge Clusters
Zhao, Zhuoran
Barijough, Kamyar Mirzazad
Gerstlauer, Andreas
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2018, 37 (11) : 2348 - 2359
[42] IARA: An Intelligent Application-aware VNF for Network Resource Allocation with Deep Learning
Xu, Jun
Wang, Jingyu
Qi, Qi
Sun, Haifeng
He, Bo
2018 15TH ANNUAL IEEE INTERNATIONAL CONFERENCE ON SENSING, COMMUNICATION, AND NETWORKING (SECON), 2018, : 458 - 460
[43] Space Information Network Resource Scheduling for Cloud Computing: A Deep Reinforcement Learning Approach
Wang, Yufei
Liu, Jun
Yin, Yanhua
Tong, Yu
Liu, Jiansheng
WIRELESS COMMUNICATIONS & MOBILE COMPUTING, 2022, 2022
[44] Efficient Distributed Energy Resource Voltage Control Using Ensemble Deep Reinforcement Learning
Obert, James
Trevizan, Rodrigo D.
Chavez, Adrian
INTERNATIONAL JOURNAL OF SEMANTIC COMPUTING, 2023, 17 (02) : 293 - 308
[45] Resource-Efficient Distributed Deep Neural Networks Empowered by Intelligent Software-Defined Networking
Lu, Ke
Du, Zhekai
Li, Jingjing
Min, Geyong
IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, 2022, 19 (04): : 4069 - 4081
[46] Verification of intelligent scheduling based on deep reinforcement learning for distributed workshops via discrete event simulation
Yang, S. L.
Wang, J. Y.
Xin, L. M.
Xu, Z. G.
ADVANCES IN PRODUCTION ENGINEERING & MANAGEMENT, 2022, 17 (04): : 401 - 412
[47] A deep reinforcement learning based hybrid algorithm for efficient resource scheduling in edge computing environment
Xue, Fei
Hai, Qiuru
Dong, Tingting
Cui, Zhihua
Gong, Yuelu
INFORMATION SCIENCES, 2022, 608 : 362 - 374
[48] Addressing Straggler Problem Through Dynamic Partial All-Reduce for Distributed Deep Learning in Heterogeneous GPU Clusters
Kim, HyungJun
Song, Chunggeon
Lee, HwaMin
Yu, Heonchang
2023 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS, ICCE, 2023,
[49] Data-driven dynamic resource scheduling for network slicing: A Deep reinforcement learning approach
Wang, Haozhe
Wu, Yulei
Min, Geyong
Xu, Jie
Tang, Pengcheng
INFORMATION SCIENCES, 2019, 498 : 106 - 116
[50] Intelligent Scheduling for Group Distributed Manufacturing Systems: Harnessing Deep Reinforcement Learning in Cloud-Edge Cooperation
Guo, Peng
Xiong, Jianyu
Wang, Yi
Meng, Xiangyin
Qian, Linmao
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2024, 8 (02): : 1687 - 1698

← 1 2 3 4 5 →