SCHED2 : Scheduling Deep Learning Training via Deep Reinforcement Learning

被引：0

作者：

Luan, Yunteng ^{[1
]}

Chen, Xukun ^{[1
]}

Zhao, Hanyu ^{[1
]}

Yang, Zhi ^{[1
]}

Dai, Yafei ^{[1
]}

机构：

[1] Peking Univ, Comp Sci Dept, Beijing, Peoples R China

来源：

2019 IEEE GLOBAL COMMUNICATIONS CONFERENCE (GLOBECOM) | 2019年

关键词：

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Today's companies and organizations build GPU clusters for efficient deep learning training (DLT). However, the inherent heterogeneity of DLT workloads makes it challenging to perform efficient scheduling of the GPUs. On one hand, DLT jobs typically exhibit diverse performance sensitivity to GPU locality; the scheduler should allocate GPUs with appropriate degree of locality for better performance and utilization. On the other hand, DLT jobs are also diverse in terms of size and duration, which can lead to severe cluster fragmentation and less chance for finding GPUs with good locality. In this paper, we present SCHED2, a GPU cluster scheduler that leverages deep reinforcement learning (DRL) to perform smart locality-aware scheduling of DLT jobs. This is achieved by a novel design which captures both jobs' locality-sensitivity and cluster fragmentation condition in the whole learning stack, i.e., from job and cluster state definitions to the neural network architecture. Through this awareness, the DRL model can adjust its scheduling decisions dynamically and adaptively, to react to individual jobs' different locality-sensitivity and changing cluster fragmentation level. Experiments using realistic workloads demonstrate that SCHED2 reduces average JCT by 4.6x and makespan by 2.1x, compared to heuristic-based schedulers.

引用

页数：7

共 50 条

[1] Learning to Dispatch for Job Shop Scheduling via Deep Reinforcement Learning
Zhang, Cong
Song, Wen
Cao, Zhiguang
Zhang, Jie
Tan, Puay Siew
Xu, Chi
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
[2] Bayesian Deep Reinforcement Learning via Deep Kernel Learning
Junyu Xuan
Jie Lu
Zheng Yan
Guangquan Zhang
[J]. International Journal of Computational Intelligence Systems, 2018, 12 : 164 - 171
[3] Robust Deep Reinforcement Learning Scheduling via Weight Anchoring
Gracla, Steffen
Beck, Edgar
Bockelmann, Carsten
Dekorsy, Armin
[J]. IEEE COMMUNICATIONS LETTERS, 2023, 27 (01) : 210 - 213
[4] Dynamic Job Shop Scheduling via Deep Reinforcement Learning
Liang, Xinjie
Song, Wen
Wei, Pengfei
[J]. 2023 IEEE 35TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2023, : 369 - 376
[5] Bayesian Deep Reinforcement Learning via Deep Kernel Learning
Xuan, Junyu
Lu, Jie
Yan, Zheng
Zhang, Guangquan
[J]. INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE SYSTEMS, 2019, 12 (01) : 164 - 171
[6] Optimal Order Acceptance and Scheduling via Deep Reinforcement Learning
Qian, Jing
Chen, Chao
Wu, Keyu
Yu, Li
[J]. 2022 6TH INTERNATIONAL SYMPOSIUM ON COMPUTER SCIENCE AND INTELLIGENT CONTROL, ISCSIC, 2022, : 63 - 68
[7] Learning to Walk via Deep Reinforcement Learning
Haarnoja, Tuomas
Ha, Sehoon
Zhou, Aurick
Tan, Jie
Tucker, George
Levine, Sergey
[J]. ROBOTICS: SCIENCE AND SYSTEMS XV, 2019,
[8] Deep sparse representation via deep dictionary learning for reinforcement learning
Tang, Jianhao
Li, Zhenni
Xie, Shengli
Ding, Shuxue
Zheng, Shaolong
Chen, Xueni
[J]. 2022 41ST CHINESE CONTROL CONFERENCE (CCC), 2022, : 2398 - 2403
[9] Scheduling the NASA Deep Space Network with Deep Reinforcement Learning
Goh, Edwin
Venkataram, Hamsa Shwetha
Hoffmann, Mark
Johnston, Mark D.
Wilson, Brian
[J]. 2021 IEEE AEROSPACE CONFERENCE (AEROCONF 2021), 2021,
[10] Intelligent scheduling and reconfiguration via deep reinforcement learning in smart manufacturing
Yang, Shengluo
Xu, Zhigang
[J]. INTERNATIONAL JOURNAL OF PRODUCTION RESEARCH, 2022, 60 (16) : 4936 - 4953

← 1 2 3 4 5 →