Topology-Aware GPU Scheduling for Learning Workloads in Cloud Environments

被引：34

作者：

Amaral, Marcelo ^{[1
]}

Polo, Jorda ^{[2
]}

Carrera, David ^{[1
]}

Seelam, Seetharami ^{[3
]}

Steinder, Malgorzata ^{[3
]}

机构：

[1] Univ Politecn Cataluna, Barcelona Supercomp Ctr, Barcelona, Spain

[2] Barcelona Supercomp Ctr, Barcelona, Spain

[3] IBM Watson Res Ctr, Yorktown Hts, NY USA

来源：

SC'17: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS | 2017年

基金：

欧洲研究理事会;

关键词：

Scheduling; Placement; GPU; Multi-GPU; Performance Analysis; Resource Contention; Workload Interference and Deep Learning;

D O I：

10.1145/3126908.3126933

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Recent advances in hardware, such as systems with multiple GPUs and their availability in the cloud, are enabling deep learning in various domains including health care, autonomous vehicles, and Internet of Things. Multi-GPU systems exhibit complex connectivity among GPUs and between GPUs and CPUs. Workload schedulers must consider hardware topology and workload communication requirements in order to allocate CPU and GPU resources for optimal execution time and improved utilization in shared cloud environments. This paper presents a new topology-aware workload placement strategy to schedule deep learning jobs on multi-GPU systems. The placement strategy is evaluated with a prototype on a Power8 machine with Tesla P100 cards, showing speedups of up to approximate to 1.30x compared to state-of-the-art strategies; the proposed algorithm achieves this result by allocating GPUs that satisfy workload requirements while preventing interference. Additionally, a largescale simulation shows that the proposed strategy provides higher resource utilization and performance in cloud systems.

引用

页数：12

共 50 条

[1] Topology-Aware Scheduling Framework for Microservice Applications in Cloud
Li, Xin
Zhou, Junsong
Wei, Xin
Li, Dawei
Qian, Zhuzhong
Wu, Jie
Qin, Xiaolin
Lu, Sanglu
[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2023, 34 (05) : 1635 - 1649
[2] Topology-Aware Job Scheduling for Machine Learning Cluster
Lu, Jingyuan
Li, Peng
Wang, Kun
Feng, Huibin
Guo, Enting
Wang, Xiaoyan
Guo, Song
[J]. 2019 IEEE GLOBAL COMMUNICATIONS CONFERENCE (GLOBECOM), 2019,
[3] Topology-Aware OpenMP Process Scheduling
Thoman, Peter
Moritsch, Hans
Fahringer, Thomas
[J]. BEYOND LOOP LEVEL PARALLELISM IN OPENMP: ACCELERATORS, TASKING AND MORE, PROCEEDINGS, 2010, 6132 : 96 - 108
[4] Topology-Aware GPU Selection on Multi-GPU Nodes
Faraji, Iman
Mirsadeghi, Seyed H.
Afsahi, Ahmad
[J]. 2016 IEEE 30TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2016, : 712 - 720
[5] Topology-Aware Resource Allocation for Data-Intensive Workloads
Lee, Gunho
Tolia, Niraj
Ranganathan, Parthasarathy
Katz, Randy H.
[J]. ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2011, 41 (01) : 120 - 124
[6] A Topology-Aware Performance Prediction Model for Distributed Deep Learning on GPU Clusters
Lin, Zheyu
Chen, Xukun
Zhao, Hanyu
Luan, Yunteng
Yang, Zhi
Dai, Yafei
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2020, : 2795 - 2801
[7] Deep Reinforcement Learning for Topology-Aware VNF Resource Prediction in NFV Environments
Jalodia, Nikita
Henna, Shagufta
Davy, Alan
[J]. 2019 IEEE CONFERENCE ON NETWORK FUNCTION VIRTUALIZATION AND SOFTWARE DEFINED NETWORKS (IEEE NFV-SDN), 2019,
[8] Effects of Topology-Aware Allocation Policies on Scheduling Performance
Antonio Pascual, Jose
Navaridas, Javier
Miguel-Alonso, Jose
[J]. JOB SCHEDULING STRATEGIES FOR PARALLEL PROCESSING, 2009, 5798 : 138 - 156
[9] A topology-aware method for scientific application deployment on cloud
Fan, Pei
Chen, Zhenbang
Wang, Ji
Zheng, Zibin
Lyu, Michael R.
[J]. INTERNATIONAL JOURNAL OF WEB AND GRID SERVICES, 2014, 10 (04) : 338 - 370
[10] Using topology-aware communication services in grid environments
[J]. ACM; IEEE Computer Society (TFCC) (IEEE Computer Society):

← 1 2 3 4 5 →