Liquid: Intelligent Resource Estimation and Network-Efficient Scheduling for Deep Learning Jobs on Distributed GPU Clusters

被引:51
|
作者
Gu, Rong [1 ]
Chen, Yuquan [1 ]
Liu, Shuai [1 ]
Dai, Haipeng [1 ]
Chen, Guihai
Zhang, Kai [2 ]
Che, Yang [1 ,2 ]
Huang, Yihua [1 ]
机构
[1] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing 210023, Jiangsu, Peoples R China
[2] Alibaba Grp, Hangzhou 311121, Zhejiang, Peoples R China
基金
美国国家科学基金会;
关键词
Graphics processing units; Processor scheduling; Resource management; Estimation; Liquids; Optimization; Training; Job scheduling; resource management; deep learning; GPU clusters;
D O I
10.1109/TPDS.2021.3138825
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Deep learning (DL) is becoming increasingly popular in many domains, including computer vision, speech recognition, self-driving automobiles, etc. GPU can train DL models efficiently but is expensive, which motivates users to share GPU resource to reduce money costs in practice. To ensure efficient sharing among multiple users, it is necessary to develop efficient GPU resource management and scheduling solutions. However, existing ones have several shortcomings. First, they require the users to specify the job resource requirement which is usually quite inaccurate and leads to cluster resource underutilization. Second, when scheduling DL jobs, they rarely take the cluster network characteristics into consideration, resulting in low job execution performance. To overcome the above issues, we propose Liquid, an efficient GPU resource management platform for DL jobs with intelligent resource requirement estimation and scheduling. First, we propose a regression model based method for job resource requirement estimation to avoid users over-allocating computing resources. Second, we propose intelligent cluster network-efficient scheduling methods in both immediate and batch modes based on the above resource requirement estimation techniques. Third, we further propose three system-level optimizations, including pre-scheduling data transmission, fine-grained GPU sharing, and event-driven communication. Experimental results show that our Liquid can accelerate the job execution speed by 18% on average and shorten the average job completion time (JCT) by 21% compared with cutting-edge solutions. Moreover, the proposed optimization methods are effective in various scenarios.
引用
收藏
页码:2808 / 2820
页数:13
相关论文
共 50 条
  • [41] DeepThings: Distributed Adaptive Deep Learning Inference on Resource-Constrained IoT Edge Clusters
    Zhao, Zhuoran
    Barijough, Kamyar Mirzazad
    Gerstlauer, Andreas
    IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2018, 37 (11) : 2348 - 2359
  • [42] IARA: An Intelligent Application-aware VNF for Network Resource Allocation with Deep Learning
    Xu, Jun
    Wang, Jingyu
    Qi, Qi
    Sun, Haifeng
    He, Bo
    2018 15TH ANNUAL IEEE INTERNATIONAL CONFERENCE ON SENSING, COMMUNICATION, AND NETWORKING (SECON), 2018, : 458 - 460
  • [43] Space Information Network Resource Scheduling for Cloud Computing: A Deep Reinforcement Learning Approach
    Wang, Yufei
    Liu, Jun
    Yin, Yanhua
    Tong, Yu
    Liu, Jiansheng
    WIRELESS COMMUNICATIONS & MOBILE COMPUTING, 2022, 2022
  • [44] Efficient Distributed Energy Resource Voltage Control Using Ensemble Deep Reinforcement Learning
    Obert, James
    Trevizan, Rodrigo D.
    Chavez, Adrian
    INTERNATIONAL JOURNAL OF SEMANTIC COMPUTING, 2023, 17 (02) : 293 - 308
  • [45] Resource-Efficient Distributed Deep Neural Networks Empowered by Intelligent Software-Defined Networking
    Lu, Ke
    Du, Zhekai
    Li, Jingjing
    Min, Geyong
    IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, 2022, 19 (04): : 4069 - 4081
  • [46] Verification of intelligent scheduling based on deep reinforcement learning for distributed workshops via discrete event simulation
    Yang, S. L.
    Wang, J. Y.
    Xin, L. M.
    Xu, Z. G.
    ADVANCES IN PRODUCTION ENGINEERING & MANAGEMENT, 2022, 17 (04): : 401 - 412
  • [47] A deep reinforcement learning based hybrid algorithm for efficient resource scheduling in edge computing environment
    Xue, Fei
    Hai, Qiuru
    Dong, Tingting
    Cui, Zhihua
    Gong, Yuelu
    INFORMATION SCIENCES, 2022, 608 : 362 - 374
  • [48] Addressing Straggler Problem Through Dynamic Partial All-Reduce for Distributed Deep Learning in Heterogeneous GPU Clusters
    Kim, HyungJun
    Song, Chunggeon
    Lee, HwaMin
    Yu, Heonchang
    2023 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS, ICCE, 2023,
  • [49] Data-driven dynamic resource scheduling for network slicing: A Deep reinforcement learning approach
    Wang, Haozhe
    Wu, Yulei
    Min, Geyong
    Xu, Jie
    Tang, Pengcheng
    INFORMATION SCIENCES, 2019, 498 : 106 - 116
  • [50] Intelligent Scheduling for Group Distributed Manufacturing Systems: Harnessing Deep Reinforcement Learning in Cloud-Edge Cooperation
    Guo, Peng
    Xiong, Jianyu
    Wang, Yi
    Meng, Xiangyin
    Qian, Linmao
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2024, 8 (02): : 1687 - 1698