Poster Abstract: Deep Learning Workloads Scheduling with Reinforcement Learning on GPU Clusters

被引:3
|
作者
Chen, Zhaoyun [1 ]
Luo, Lei [1 ]
Quan, Wei [1 ]
Wen, Mei [1 ]
Zhang, Chunyuan [1 ]
机构
[1] Natl Univ Def Technol, Coll Comp, Changsha, Peoples R China
关键词
DL platform; Reinforcement Learning; Scheduling; GPU clusters;
D O I
10.1109/infcomw.2019.8845276
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
With the recent widespread adoption of deep learning (DL) in academia and industry, more attention are attracted by DL platform, which can support research and development (R&D) of AI firms, institutes and universities. Towards an off-the-shelf distributed GPU cluster, prior work propose prediction-based schedulers to allocate resources for diverse DL workloads. However, the prediction-based schedulers have disadvantages on prediction accuracy and offline-profiling costs. In this paper, we propose a learning-based scheduler, which models the scheduling problem as a reinforcement learning problem, achieving minimum average job completion time and maximum system utilization. The scheduler contains the designs of state space, action space, reward function and update scheme. Furthermore, we will evaluate our proposed scheduler implemented as a plugin of Tensorflow on real cluster and large-scale simulation.
引用
收藏
页码:1023 / 1024
页数:2
相关论文
共 50 条
  • [1] Online Evolutionary Batch Size Orchestration for Scheduling Deep Learning Workloads in GPU Clusters
    Bian, Zhengda
    Li, Shenggui
    Wang, Wei
    You, Yang
    [J]. SC21: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2021,
  • [2] Reliability of Large Scale GPU Clusters for Deep Learning Workloads
    Qian, Junjie
    Kim, Taeyoon
    Jeon, Myeongjae
    [J]. WEB CONFERENCE 2021: COMPANION OF THE WORLD WIDE WEB CONFERENCE (WWW 2021), 2021, : 179 - 181
  • [3] DASH: Scheduling Deep Learning Workloads on Multi-Generational GPU-Accelerated Clusters
    Li, Baolin
    Patel, Tirthak
    Gadepally, Vijay
    Gettings, Karen
    Samsi, Siddharth
    Tiwari, Devesh
    [J]. 2022 IEEE HIGH PERFORMANCE EXTREME COMPUTING VIRTUAL CONFERENCE (HPEC), 2022,
  • [4] Poster Abstract: Smart Irrigation Control Using Deep Reinforcement Learning
    Ding, Xianzhong
    Du, Wan
    [J]. 2022 21ST ACM/IEEE INTERNATIONAL CONFERENCE ON INFORMATION PROCESSING IN SENSOR NETWORKS (IPSN 2022), 2022, : 539 - 540
  • [5] Poster Abstract: Towards Adaptive Anomaly Detection in Buildings with Deep Reinforcement Learning
    Wu, Tong
    Ortiz, Jorge
    [J]. BUILDSYS'19: PROCEEDINGS OF THE 6TH ACM INTERNATIONAL CONFERENCE ON SYSTEMS FOR ENERGY-EFFICIENT BUILDINGS, CITIES, AND TRANSPORTATION, 2019, : 380 - 382
  • [6] Voda: A GPU Scheduling Platform for Elastic Deep Learning in Kubernetes Clusters
    Hsieh, Tsung-Tso
    Lee, Che-Rung
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON CLOUD ENGINEERING, IC2E, 2023, : 131 - 140
  • [7] Understanding of GPU Architectural Vulnerability for Deep Learning Workloads
    Santoso, Danny
    Jeon, Hyeran
    [J]. 2019 IEEE INTERNATIONAL SYMPOSIUM ON DEFECT AND FAULT TOLERANCE IN VLSI AND NANOTECHNOLOGY SYSTEMS (DFT), 2019,
  • [8] Optimizing Deep Learning Workloads on ARM GPU with TVM
    Zheng, Lianmin
    Chen, Tianqi
    [J]. 1ST ACM REQUEST WORKSHOP/TOURNAMENT ON REPRODUCIBLE SOFTWARE/HARDWARE CO-DESIGN OF PARETO-EFFICIENT DEEP LEARNING, 2018,
  • [9] Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?
    Awan, Ammar Ahmad
    Subramoni, Hari
    Chu, Ching-Hsiang
    Panda, Dhabaleswar K.
    [J]. EUROMPI 2018: PROCEEDINGS OF THE 25TH EUROPEAN MPI USERS' GROUP MEETING, 2018,
  • [10] Evaluating On-Node GPU Interconnects for Deep Learning Workloads
    Tallent, Nathan R.
    Gawande, Nitin A.
    Siegel, Charles
    Vishnu, Abhinav
    Hoisie, Adolfy
    [J]. HIGH PERFORMANCE COMPUTING SYSTEMS: PERFORMANCE MODELING, BENCHMARKING, AND SIMULATION (PMBS 2017), 2018, 10724 : 3 - 21