An Efficient and Non-Intrusive GPU Scheduling Framework for Deep Learning Training Systems

被引:43
|
作者
Wang, Shaoqi [1 ]
Gonzalez, Oscar J. [2 ]
Zhou, Xiaobo [1 ]
Williams, Thomas [2 ]
Friedman, Brian D. [2 ]
Havemann, Martin [2 ]
Woo, Thomas [2 ]
机构
[1] Univ Colorado, Dept Comp Sci, Colorado Springs, CO 80907 USA
[2] Nokia Bell Labs, New Providence, NJ USA
关键词
deep learning; GPU dusters; resource scheduling; container; Kubernetes;
D O I
10.1109/SC41405.2020.00094
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Efficient GPU scheduling is the key to minimizing the execution time of the Deep Learning (DL) training workloads. DL training system schedulers typically allocate a fixed number of GPUs to each job, which inhibits high resource utilization and often extends the overall training time. The recent introduction of schedulers that can dynamically reallocate GPUs has achieved better cluster efficiency. This dynamic nature, however, introduces additional overhead by terminating and restarting jobs or requires modification to the DL training frameworks. We propose and develop an efficient, non-intrusive GPU scheduling framework that employs a combination of an adaptive GPU scheduler and an elastic GPU allocation mechanism to reduce the completion time of DL training workloads and improve resource utilization. Specifically, the adaptive GPU scheduler includes a scheduling algorithm that uses training job progress information to determine the most efficient allocation and reallocation of GPUs for incoming and running jobs at any given time. The elastic GPIJ allocation mechanism works in concert with the scheduler. It offers a lightweight and non-intrusive method to reallocate Gl'Us based on a "SideCar" process that temporarily stops and restarts the job's DL training process with a different number of GPUs. We implemented the scheduling framework as plugins in Kubernetes and conducted evaluations on two 16-GPU dusters with multiple training jobs based on TensorFlow. Results show that our proposed scheduling framework reduces the overall execution time and the average job completion time by up to 45% and 63%, respectively, compared to the Kubernetes default scheduler. Compared to a termination-based scheduler, our framework reduces the overall execution time and the average job completion time by up to 20% and 37%, respectively.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] Lucid: A Non-intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs
    Hu, Qinghao
    Zhang, Meng
    Sun, Peng
    Wen, Yonggang
    Zhang, Tianwei
    PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS, VOL 2, ASPLOS 2023, 2023, : 457 - 472
  • [2] Crux: GPU-Efficient Communication Scheduling for Deep Learning Training
    Cao, Jiamin
    Guan, Yu
    Qian, Kun
    Gao, Jiaqi
    Xiao, Wencong
    Dong, Jianbo
    Fu, Binzhang
    Cai, Dennis
    Zhai, Ennan
    PROCEEDINGS OF THE 2024 ACM SIGCOMM 2024 CONFERENCE, ACM SIGCOMM 2024, 2024, : 1 - 15
  • [3] Deep Learning Application to Non-Intrusive Load Monitoring
    Nguyen Viet Linh
    Arboleya, Pablo
    2019 IEEE MILAN POWERTECH, 2019,
  • [4] Non-Intrusive Scheduling of TCP Flows
    Ayesta, U.
    Bertaux, L.
    Carvin, D.
    2015 IFIP NETWORKING CONFERENCE (IFIP NETWORKING), 2015,
  • [5] Non-intrusive model combination for learning dynamical systems
    Wu, Shiqi
    Chamoin, Ludovic
    Li, Qianxiao
    PHYSICA D-NONLINEAR PHENOMENA, 2024, 463
  • [6] Non-Intrusive Load Monitoring Based on an Efficient Deep Learning Model With Local Feature Extraction
    Zhou, Kaile
    Zhang, Zhiyue
    Lu, Xinhui
    IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2024, 20 (07) : 9497 - 9507
  • [7] Efficient Multi-Training Framework of Image Deep Learning on GPU Cluster
    Chen, Chun-Fu
    Lee, Gwo Giun
    Xia, Yinglong
    Lin, W. Sabrina
    Suzumura, Toyotaro
    Lin, Ching-Yung
    2015 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM), 2015, : 489 - 494
  • [8] Non-Intrusive A/C Load Disaggregation Using Deep Learning
    Cho, Jin
    Hu, Zhen
    Sartipi, Mina
    2018 IEEE/PES TRANSMISSION AND DISTRIBUTION CONFERENCE AND EXPOSITION (T&D), 2018,
  • [9] A Non-Intrusive Deep Learning Based Diagnosis System for Elevators
    Chai, Songjian
    Li, Xuran Ivan
    Jia, Youwei
    He, Yufei
    Yip, Chi Ho
    Cheung, Ka Kei
    Wang, Minghao
    IEEE ACCESS, 2021, 9 : 20993 - 21003
  • [10] An efficient non-intrusive checkpointing algorithm for distributed database systems
    Wu, Jiang
    Manivarman, D.
    DISTRIBUTED COMPUTING AND NETWORKING, PROCEEDINGS, 2006, 4308 : 82 - 87