SCHED2 : Scheduling Deep Learning Training via Deep Reinforcement Learning

被引:0
|
作者
Luan, Yunteng [1 ]
Chen, Xukun [1 ]
Zhao, Hanyu [1 ]
Yang, Zhi [1 ]
Dai, Yafei [1 ]
机构
[1] Peking Univ, Comp Sci Dept, Beijing, Peoples R China
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Today's companies and organizations build GPU clusters for efficient deep learning training (DLT). However, the inherent heterogeneity of DLT workloads makes it challenging to perform efficient scheduling of the GPUs. On one hand, DLT jobs typically exhibit diverse performance sensitivity to GPU locality; the scheduler should allocate GPUs with appropriate degree of locality for better performance and utilization. On the other hand, DLT jobs are also diverse in terms of size and duration, which can lead to severe cluster fragmentation and less chance for finding GPUs with good locality. In this paper, we present SCHED2, a GPU cluster scheduler that leverages deep reinforcement learning (DRL) to perform smart locality-aware scheduling of DLT jobs. This is achieved by a novel design which captures both jobs' locality-sensitivity and cluster fragmentation condition in the whole learning stack, i.e., from job and cluster state definitions to the neural network architecture. Through this awareness, the DRL model can adjust its scheduling decisions dynamically and adaptively, to react to individual jobs' different locality-sensitivity and changing cluster fragmentation level. Experiments using realistic workloads demonstrate that SCHED2 reduces average JCT by 4.6x and makespan by 2.1x, compared to heuristic-based schedulers.
引用
收藏
页数:7
相关论文
共 50 条
  • [1] Learning to Dispatch for Job Shop Scheduling via Deep Reinforcement Learning
    Zhang, Cong
    Song, Wen
    Cao, Zhiguang
    Zhang, Jie
    Tan, Puay Siew
    Xu, Chi
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [2] Bayesian Deep Reinforcement Learning via Deep Kernel Learning
    Junyu Xuan
    Jie Lu
    Zheng Yan
    Guangquan Zhang
    [J]. International Journal of Computational Intelligence Systems, 2018, 12 : 164 - 171
  • [3] Robust Deep Reinforcement Learning Scheduling via Weight Anchoring
    Gracla, Steffen
    Beck, Edgar
    Bockelmann, Carsten
    Dekorsy, Armin
    [J]. IEEE COMMUNICATIONS LETTERS, 2023, 27 (01) : 210 - 213
  • [4] Dynamic Job Shop Scheduling via Deep Reinforcement Learning
    Liang, Xinjie
    Song, Wen
    Wei, Pengfei
    [J]. 2023 IEEE 35TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2023, : 369 - 376
  • [5] Bayesian Deep Reinforcement Learning via Deep Kernel Learning
    Xuan, Junyu
    Lu, Jie
    Yan, Zheng
    Zhang, Guangquan
    [J]. INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE SYSTEMS, 2019, 12 (01) : 164 - 171
  • [6] Optimal Order Acceptance and Scheduling via Deep Reinforcement Learning
    Qian, Jing
    Chen, Chao
    Wu, Keyu
    Yu, Li
    [J]. 2022 6TH INTERNATIONAL SYMPOSIUM ON COMPUTER SCIENCE AND INTELLIGENT CONTROL, ISCSIC, 2022, : 63 - 68
  • [7] Learning to Walk via Deep Reinforcement Learning
    Haarnoja, Tuomas
    Ha, Sehoon
    Zhou, Aurick
    Tan, Jie
    Tucker, George
    Levine, Sergey
    [J]. ROBOTICS: SCIENCE AND SYSTEMS XV, 2019,
  • [8] Deep sparse representation via deep dictionary learning for reinforcement learning
    Tang, Jianhao
    Li, Zhenni
    Xie, Shengli
    Ding, Shuxue
    Zheng, Shaolong
    Chen, Xueni
    [J]. 2022 41ST CHINESE CONTROL CONFERENCE (CCC), 2022, : 2398 - 2403
  • [9] Scheduling the NASA Deep Space Network with Deep Reinforcement Learning
    Goh, Edwin
    Venkataram, Hamsa Shwetha
    Hoffmann, Mark
    Johnston, Mark D.
    Wilson, Brian
    [J]. 2021 IEEE AEROSPACE CONFERENCE (AEROCONF 2021), 2021,
  • [10] Intelligent scheduling and reconfiguration via deep reinforcement learning in smart manufacturing
    Yang, Shengluo
    Xu, Zhigang
    [J]. INTERNATIONAL JOURNAL OF PRODUCTION RESEARCH, 2022, 60 (16) : 4936 - 4953