Poster Abstract: Deep Learning Workloads Scheduling with Reinforcement Learning on GPU Clusters

被引：3

作者：

Chen, Zhaoyun ^{[1
]}

Luo, Lei ^{[1
]}

Quan, Wei ^{[1
]}

Wen, Mei ^{[1
]}

Zhang, Chunyuan ^{[1
]}

机构：

[1] Natl Univ Def Technol, Coll Comp, Changsha, Peoples R China

来源：

IEEE CONFERENCE ON COMPUTER COMMUNICATIONS WORKSHOPS (IEEE INFOCOM 2019 WKSHPS) | 2019年

关键词：

DL platform; Reinforcement Learning; Scheduling; GPU clusters;

D O I：

10.1109/infcomw.2019.8845276

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

With the recent widespread adoption of deep learning (DL) in academia and industry, more attention are attracted by DL platform, which can support research and development (R&D) of AI firms, institutes and universities. Towards an off-the-shelf distributed GPU cluster, prior work propose prediction-based schedulers to allocate resources for diverse DL workloads. However, the prediction-based schedulers have disadvantages on prediction accuracy and offline-profiling costs. In this paper, we propose a learning-based scheduler, which models the scheduling problem as a reinforcement learning problem, achieving minimum average job completion time and maximum system utilization. The scheduler contains the designs of state space, action space, reward function and update scheme. Furthermore, we will evaluate our proposed scheduler implemented as a plugin of Tensorflow on real cluster and large-scale simulation.

引用

页码：1023 / 1024

页数：2

共 50 条

[1] Online Evolutionary Batch Size Orchestration for Scheduling Deep Learning Workloads in GPU Clusters
Bian, Zhengda
Li, Shenggui
Wang, Wei
You, Yang
[J]. SC21: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2021,
[2] Reliability of Large Scale GPU Clusters for Deep Learning Workloads
Qian, Junjie
Kim, Taeyoon
Jeon, Myeongjae
[J]. WEB CONFERENCE 2021: COMPANION OF THE WORLD WIDE WEB CONFERENCE (WWW 2021), 2021, : 179 - 181
[3] DASH: Scheduling Deep Learning Workloads on Multi-Generational GPU-Accelerated Clusters
Li, Baolin
Patel, Tirthak
Gadepally, Vijay
Gettings, Karen
Samsi, Siddharth
Tiwari, Devesh
[J]. 2022 IEEE HIGH PERFORMANCE EXTREME COMPUTING VIRTUAL CONFERENCE (HPEC), 2022,
[4] Poster Abstract: Smart Irrigation Control Using Deep Reinforcement Learning
Ding, Xianzhong
Du, Wan
[J]. 2022 21ST ACM/IEEE INTERNATIONAL CONFERENCE ON INFORMATION PROCESSING IN SENSOR NETWORKS (IPSN 2022), 2022, : 539 - 540
[5] Poster Abstract: Towards Adaptive Anomaly Detection in Buildings with Deep Reinforcement Learning
Wu, Tong
Ortiz, Jorge
[J]. BUILDSYS'19: PROCEEDINGS OF THE 6TH ACM INTERNATIONAL CONFERENCE ON SYSTEMS FOR ENERGY-EFFICIENT BUILDINGS, CITIES, AND TRANSPORTATION, 2019, : 380 - 382
[6] Voda: A GPU Scheduling Platform for Elastic Deep Learning in Kubernetes Clusters
Hsieh, Tsung-Tso
Lee, Che-Rung
[J]. 2023 IEEE INTERNATIONAL CONFERENCE ON CLOUD ENGINEERING, IC2E, 2023, : 131 - 140
[7] Understanding of GPU Architectural Vulnerability for Deep Learning Workloads
Santoso, Danny
Jeon, Hyeran
[J]. 2019 IEEE INTERNATIONAL SYMPOSIUM ON DEFECT AND FAULT TOLERANCE IN VLSI AND NANOTECHNOLOGY SYSTEMS (DFT), 2019,
[8] Optimizing Deep Learning Workloads on ARM GPU with TVM
Zheng, Lianmin
Chen, Tianqi
[J]. 1ST ACM REQUEST WORKSHOP/TOURNAMENT ON REPRODUCIBLE SOFTWARE/HARDWARE CO-DESIGN OF PARETO-EFFICIENT DEEP LEARNING, 2018,
[9] Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?
Awan, Ammar Ahmad
Subramoni, Hari
Chu, Ching-Hsiang
Panda, Dhabaleswar K.
[J]. EUROMPI 2018: PROCEEDINGS OF THE 25TH EUROPEAN MPI USERS' GROUP MEETING, 2018,
[10] Evaluating On-Node GPU Interconnects for Deep Learning Workloads
Tallent, Nathan R.
Gawande, Nitin A.
Siegel, Charles
Vishnu, Abhinav
Hoisie, Adolfy
[J]. HIGH PERFORMANCE COMPUTING SYSTEMS: PERFORMANCE MODELING, BENCHMARKING, AND SIMULATION (PMBS 2017), 2018, 10724 : 3 - 21

← 1 2 3 4 5 →