Elastic scheduler: Heterogeneous and dynamic deep Learning in the cloud

被引:0
|
作者
Yin, Lujia [1 ]
Zhang, Yiming [1 ]
Peng, Yuxing [1 ]
Li, Dongsheng [1 ]
机构
[1] Natl Univ Def Technol, Changsha 410005, Hunan, Peoples R China
来源
关键词
deep learning; dynamic training; elastic scheduler; heterogeneous training;
D O I
10.1002/cpe.6206
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
GPUs and CPUs have been widely used for model training of deep learning (DL) in the cloud, where both DL workloads and resource usage might heavily change over time. Traditional training methods require beforehand specification on the type (either GPUs or CPUs) and amount of computing devices, and thus cannot elastically schedule the dynamic DL workloads onto available GPUs/CPUs. In this paper, we propose Elastic Scheduler (ES), a novel approach that efficiently supports both heterogeneous training (with different device types) and dynamic training (with varying device numbers). ES (i) accumulates local gradients and simulates multiple virtual workers on one GPU to alleviate the performance gap between GPUs and CPUs for achieving similar accuracy in heterogeneous GPU-CPU-hybrid training as in homogeneous training and (ii) uses local gradients stabilizes batch sizes for high accuracy without long compensation. Experiments show that ES achieves significantly higher performance than existing methods for heterogeneous and dynamic training as well as inference.
引用
下载
收藏
页数:13
相关论文
共 50 条
  • [1] Dynamic Scheduler Management Using Deep Learning
    Hall, James
    Moessner, Klaus
    Mackenzie, Richard
    Carrez, Francois
    Foh, Chuan Heng
    IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, 2020, 6 (02) : 575 - 585
  • [2] To cloud or not to cloud: an on-line scheduler for dynamic privacy-protection of deep learning workload on edge devices
    Yibin Tang
    Ying Wang
    Huawei Li
    Xiaowei Li
    CCF Transactions on High Performance Computing, 2021, 3 : 85 - 100
  • [3] To cloud or not to cloud: an on-line scheduler for dynamic privacy-protection of deep learning workload on edge devices
    Tang, Yibin
    Wang, Ying
    Li, Huawei
    Li, Xiaowei
    CCF TRANSACTIONS ON HIGH PERFORMANCE COMPUTING, 2021, 3 (01) : 85 - 100
  • [4] Cluster Scheduler on Heterogeneous Cloud
    Ling, Xiao
    Yang, Jiahai
    Wang, Dan
    Wang, Ye
    2015 IEEE 17TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS, 2015 IEEE 7TH INTERNATIONAL SYMPOSIUM ON CYBERSPACE SAFETY AND SECURITY, AND 2015 IEEE 12TH INTERNATIONAL CONFERENCE ON EMBEDDED SOFTWARE AND SYSTEMS (ICESS), 2015, : 772 - 777
  • [5] GADaM: Generic Adaptive Deep-learning-based Multipath Scheduler Selector for Dynamic Heterogeneous Environment
    Chu, Tran-Tuan
    Labiod, Mohamed Aymen
    Tran, Hai-Anh
    Mellouk, Abdelhamid
    IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC 2022), 2022, : 4908 - 4913
  • [6] An Adaptive Cloud Bursting Job Scheduler based on Deep Reinforcement Learning
    Yasuda, Seiju
    Lee, Chonho
    Date, Susumu
    2021 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE BIG DATA AND INTELLIGENT SYSTEMS (HPBD&IS), 2021, : 217 - 224
  • [7] A Cloud QoS-driven Scheduler based on Deep Reinforcement Learning
    Minh-Ngoc Tran
    Kim, Younghan
    12TH INTERNATIONAL CONFERENCE ON ICT CONVERGENCE (ICTC 2021): BEYOND THE PANDEMIC ERA WITH ICT CONVERGENCE INNOVATION, 2021, : 1823 - 1825
  • [8] Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters
    Peng, Yanghua
    Bao, Yixin
    Chen, Yangrui
    Wu, Chuan
    Guo, Chuanxiong
    EUROSYS '18: PROCEEDINGS OF THE THIRTEENTH EUROSYS CONFERENCE, 2018,
  • [9] A Dynamic MapReduce Scheduler for Heterogeneous Workloads
    Tian, Chao
    Zhou, Haojie
    He, Yongqiang
    Zha, Li
    2009 EIGHTH INTERNATIONAL CONFERENCE ON GRID AND COOPERATIVE COMPUTING, PROCEEDINGS, 2009, : 218 - 224
  • [10] An Energy and Temperature Aware Deep Reinforcement Learning Workflow Scheduler in Cloud Computing
    Sudheer Mangalampalli, S.
    Reddy Karri, Ganesh
    Reddy Ch, Pradeep
    Sree Pokkuluri, Kiran
    Chakrabarti, Prasun
    Chakrabarti, Tulika
    IEEE Access, 2024, 12 : 163424 - 163443