Topology-Aware GPU Scheduling for Learning Workloads in Cloud Environments

被引:34
|
作者
Amaral, Marcelo [1 ]
Polo, Jorda [2 ]
Carrera, David [1 ]
Seelam, Seetharami [3 ]
Steinder, Malgorzata [3 ]
机构
[1] Univ Politecn Cataluna, Barcelona Supercomp Ctr, Barcelona, Spain
[2] Barcelona Supercomp Ctr, Barcelona, Spain
[3] IBM Watson Res Ctr, Yorktown Hts, NY USA
基金
欧洲研究理事会;
关键词
Scheduling; Placement; GPU; Multi-GPU; Performance Analysis; Resource Contention; Workload Interference and Deep Learning;
D O I
10.1145/3126908.3126933
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Recent advances in hardware, such as systems with multiple GPUs and their availability in the cloud, are enabling deep learning in various domains including health care, autonomous vehicles, and Internet of Things. Multi-GPU systems exhibit complex connectivity among GPUs and between GPUs and CPUs. Workload schedulers must consider hardware topology and workload communication requirements in order to allocate CPU and GPU resources for optimal execution time and improved utilization in shared cloud environments. This paper presents a new topology-aware workload placement strategy to schedule deep learning jobs on multi-GPU systems. The placement strategy is evaluated with a prototype on a Power8 machine with Tesla P100 cards, showing speedups of up to approximate to 1.30x compared to state-of-the-art strategies; the proposed algorithm achieves this result by allocating GPUs that satisfy workload requirements while preventing interference. Additionally, a largescale simulation shows that the proposed strategy provides higher resource utilization and performance in cloud systems.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Topology-Aware Scheduling Framework for Microservice Applications in Cloud
    Li, Xin
    Zhou, Junsong
    Wei, Xin
    Li, Dawei
    Qian, Zhuzhong
    Wu, Jie
    Qin, Xiaolin
    Lu, Sanglu
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2023, 34 (05) : 1635 - 1649
  • [2] Topology-Aware Job Scheduling for Machine Learning Cluster
    Lu, Jingyuan
    Li, Peng
    Wang, Kun
    Feng, Huibin
    Guo, Enting
    Wang, Xiaoyan
    Guo, Song
    [J]. 2019 IEEE GLOBAL COMMUNICATIONS CONFERENCE (GLOBECOM), 2019,
  • [3] Topology-Aware OpenMP Process Scheduling
    Thoman, Peter
    Moritsch, Hans
    Fahringer, Thomas
    [J]. BEYOND LOOP LEVEL PARALLELISM IN OPENMP: ACCELERATORS, TASKING AND MORE, PROCEEDINGS, 2010, 6132 : 96 - 108
  • [4] Topology-Aware GPU Selection on Multi-GPU Nodes
    Faraji, Iman
    Mirsadeghi, Seyed H.
    Afsahi, Ahmad
    [J]. 2016 IEEE 30TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2016, : 712 - 720
  • [5] Topology-Aware Resource Allocation for Data-Intensive Workloads
    Lee, Gunho
    Tolia, Niraj
    Ranganathan, Parthasarathy
    Katz, Randy H.
    [J]. ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2011, 41 (01) : 120 - 124
  • [6] A Topology-Aware Performance Prediction Model for Distributed Deep Learning on GPU Clusters
    Lin, Zheyu
    Chen, Xukun
    Zhao, Hanyu
    Luan, Yunteng
    Yang, Zhi
    Dai, Yafei
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2020, : 2795 - 2801
  • [7] Deep Reinforcement Learning for Topology-Aware VNF Resource Prediction in NFV Environments
    Jalodia, Nikita
    Henna, Shagufta
    Davy, Alan
    [J]. 2019 IEEE CONFERENCE ON NETWORK FUNCTION VIRTUALIZATION AND SOFTWARE DEFINED NETWORKS (IEEE NFV-SDN), 2019,
  • [8] Effects of Topology-Aware Allocation Policies on Scheduling Performance
    Antonio Pascual, Jose
    Navaridas, Javier
    Miguel-Alonso, Jose
    [J]. JOB SCHEDULING STRATEGIES FOR PARALLEL PROCESSING, 2009, 5798 : 138 - 156
  • [9] A topology-aware method for scientific application deployment on cloud
    Fan, Pei
    Chen, Zhenbang
    Wang, Ji
    Zheng, Zibin
    Lyu, Michael R.
    [J]. INTERNATIONAL JOURNAL OF WEB AND GRID SERVICES, 2014, 10 (04) : 338 - 370
  • [10] Using topology-aware communication services in grid environments
    [J]. ACM; IEEE Computer Society (TFCC) (IEEE Computer Society):