Liquid: Intelligent Resource Estimation and Network-Efficient Scheduling for Deep Learning Jobs on Distributed GPU Clusters

被引：51

作者：

Gu, Rong ^{[1
]}

Chen, Yuquan ^{[1
]}

Liu, Shuai ^{[1
]}

Dai, Haipeng ^{[1
]}

Chen, Guihai

Zhang, Kai ^{[2
]}

Che, Yang ^{[1
,2
]}

Huang, Yihua ^{[1
]}

机构：

[1] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing 210023, Jiangsu, Peoples R China

[2] Alibaba Grp, Hangzhou 311121, Zhejiang, Peoples R China

来源：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS | 2022年 / 33卷 / 11期

基金：

美国国家科学基金会;

关键词：

Graphics processing units; Processor scheduling; Resource management; Estimation; Liquids; Optimization; Training; Job scheduling; resource management; deep learning; GPU clusters;

D O I：

10.1109/TPDS.2021.3138825

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Deep learning (DL) is becoming increasingly popular in many domains, including computer vision, speech recognition, self-driving automobiles, etc. GPU can train DL models efficiently but is expensive, which motivates users to share GPU resource to reduce money costs in practice. To ensure efficient sharing among multiple users, it is necessary to develop efficient GPU resource management and scheduling solutions. However, existing ones have several shortcomings. First, they require the users to specify the job resource requirement which is usually quite inaccurate and leads to cluster resource underutilization. Second, when scheduling DL jobs, they rarely take the cluster network characteristics into consideration, resulting in low job execution performance. To overcome the above issues, we propose Liquid, an efficient GPU resource management platform for DL jobs with intelligent resource requirement estimation and scheduling. First, we propose a regression model based method for job resource requirement estimation to avoid users over-allocating computing resources. Second, we propose intelligent cluster network-efficient scheduling methods in both immediate and batch modes based on the above resource requirement estimation techniques. Third, we further propose three system-level optimizations, including pre-scheduling data transmission, fine-grained GPU sharing, and event-driven communication. Experimental results show that our Liquid can accelerate the job execution speed by 18% on average and shorten the average job completion time (JCT) by 21% compared with cutting-edge solutions. Moreover, the proposed optimization methods are effective in various scenarios.

引用

页码：2808 / 2820

页数：13

共 50 条

[31] Intelligent Deep Reinforcement Learning based Resource Allocation in Fog network
Divya, V
Sri, Leena R.
2019 26TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA AND ANALYTICS WORKSHOP (HIPCW 2019), 2019, : 18 - 22
[32] Communication-Efficient Distributed Deep Learning with GPU-FPGA Heterogeneous Computing
Tanaka, Kenji
Arikawa, Yuki
Ito, Tsuyoshi
Morita, Kazutaka
Nemoto, Naru
Miura, Fumiaki
Terada, Kazuhiko
Teramoto, Junji
Sakamoto, Takeshi
2020 IEEE SYMPOSIUM ON HIGH-PERFORMANCE INTERCONNECTS (HOTI 2020), 2020, : 43 - 46
[33] Low Latency Deep Learning Inference Model for Distributed Intelligent IoT Edge Clusters
Naveen, Soumyalatha
Kounte, Manjunath R.
Ahmed, Mohammed Riyaz
IEEE ACCESS, 2021, 9 : 160607 - 160621
[34] TensorLightning: A Traffic-Efficient Distributed Deep Learning on Commodity Spark Clusters
Lee, Seil
Kim, Hanjoo
Park, Jaehong
Jang, Jaehee
Jeong, Chang-Sung
Yoon, Sungroh
IEEE ACCESS, 2018, 6 : 27671 - 27680
[35] Efficient MPI-AllReduce for large-scale deep learning on GPU-clusters
Truong Thao Nguyen
Wahib, Mohamed
Takano, Ryousei
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2021, 33 (12):
[36] Efficient NPU–GPU scheduling for real-time deep learning inference on mobile devices
Chengwu Yu
Meng Wang
Shan Chen
Wanqi Wang
Weiwei Fang
Yanming Chen
Neal N.Xiong
Journal of Real-Time Image Processing, 2025, 22 (2)
[37] An intelligent and efficient network intrusion detection system using deep learning
Qazi, Emad-ul-Haq
Imran, Muhammad
Haider, Noman
Shoaib, Muhammad
Razzak, Imran
COMPUTERS & ELECTRICAL ENGINEERING, 2022, 99
[38] An Optimal Network-Aware Scheduling Technique for Distributed Deep Learning in Distributed HPC Platforms
Lee, Sangkwon
Shah, Syed Asif Raza
Seok, Woojin
Moon, Jeonghoon
Kim, Kihyeon
Shah, Syed Hasnain Raza
ELECTRONICS, 2023, 12 (14)
[39] NAIR: An Efficient Distributed Deep Learning Architecture for Resource Constrained IoT System
Xiao, Yucong
Zhang, Daobing
Wang, Yunsheng
Dai, Xuewu
Huang, Zhipei
Zhang, Wuxiong
Yang, Yang
Anjum, Ashiq
Qin, Fei
IEEE INTERNET OF THINGS JOURNAL, 2024, 11 (12): : 21427 - 21439
[40] Intelligent real-time scheduling of water supply network based on deep learning
Pu, Zhengheng
Chen, Minghai
Ji, Xuanting
Fu, Yanfu
Tian, Wenchong
Chen, Lei
Tao, Tao
Xin, Kunlun
AQUA-WATER INFRASTRUCTURE ECOSYSTEMS AND SOCIETY, 2023, 72 (12) : 2277 - 2292

← 1 2 3 4 5 →