HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees

被引：0

作者：

Zhao, Hanyu ^{[1
,3
]}

Han, Zhenhua ^{[2
,3
]}

Yang, Zhi ^{[1
]}

Zhang, Quanlu ^{[3
]}

Yang, Fan ^{[3
]}

Zhou, Lidong ^{[3
]}

Yang, Mao ^{[3
]}

Lau, Francis C. M. ^{[2
]}

Wang, Yuqi ^{[3
]}

Xiong, Yifan ^{[3
]}

Wang, Bin ^{[3
]}

机构：

[1] Peking Univ, Beijing, Peoples R China

[2] Univ Hong Kong, Hong Kong, Peoples R China

[3] Microsoft, Redmond, WA 98052 USA

来源：

PROCEEDINGS OF THE 14TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDI '20) | 2020年

基金：

中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Deep learning training on a shared GPU cluster is becoming a common practice. However, we observe severe sharing anomaly in production multi-tenant clusters where jobs in some tenants experience worse queuing delay than they would have in a private cluster with their allocated shares of GPUs. This is because tenants use quota, the number of GPUs, to reserve resources, whereas deep learning jobs often use GPUs with a desirable GPU affinity, which quota cannot guarantee. HiveD is the first framework to share a GPU cluster safely, so that such anomaly would never happen by design. In HiveD, each tenant reserves resources through a Virtual Private Cluster (VC), defined in terms of multi-level cell structures corresponding to different levels of GPU affinity in a cluster. This design allows HiveD to incorporate any existing schedulers within each VC to achieve their respective design goals while sharing the cluster safely. HiveD develops an elegant buddy cell allocation algorithm to ensure sharing safety by efficiently managing the dynamic binding of cells from VCs to those in a physical cluster. A straightforward extension of buddy cell allocation can further support low-priority jobs to scavenge the unused GPU resources to improve cluster utilization. With a combination of real deployment and trace-driven simulation, we show that: (i) sharing anomaly exists in three state-of-the-art deep learning schedulers, incurring extra queuing delay of up to 1,000 minutes; (ii) HiveD can incorporate these schedulers and eliminate the sharing anomaly in all of them, achieving separation of concerns that allows the schedulers to focus on their own scheduling goals without violating sharing safety.

引用

页码：515 / 532

页数：18

共 50 条

[1] Tiresias: A GPU Cluster Manager for Distributed Deep Learning
Gu, Juncheng
Chowdhury, Mosharaf
Shin, Kang G.
Zhu, Yibo
Jeon, Myeongjae
Qian, Junjie
Liu, Hongqiang
Guo, Chuanxiong
PROCEEDINGS OF THE 16TH USENIX SYMPOSIUM ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION, 2019, : 485 - 500
[2] Adaptive Communication for Distributed Deep Learning on Commodity GPU Cluster
Ho, Li-Yung
Wu, Jan-Jan
Liu, Pangfeng
2018 18TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2018, : 283 - 290
[3] Deep Learning Research and Development Platform: Characterizing and Scheduling with QoS Guarantees on GPU Clusters
Chen, Zhaoyun
Quan, Wei
Wen, Mei
Fang, Jianbin
Yu, Jie
Zhang, Chunyuan
Luo, Lei
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2020, 31 (01) : 34 - 50
[4] Cost Efficient GPU Cluster Management for Training and Inference of Deep Learning
Kang, Dong-Ki
Lee, Ki-Beom
Kim, Young-Chon
ENERGIES, 2022, 15 (02)
[5] A hybrid GPU cluster and volunteer computing platform for scalable deep learning
Ekasit Kijsipongse
Apivadee Piyatumrong
Suriya U-ruekolan
The Journal of Supercomputing, 2018, 74 : 3236 - 3263
[6] Interference-aware parallelization for deep learning workload in GPU cluster
Xin Geng
Haitao Zhang
Zhengyang Zhao
Huadong Ma
Cluster Computing, 2020, 23 : 2689 - 2702
[7] Interference-aware parallelization for deep learning workload in GPU cluster
Geng, Xin
Zhang, Haitao
Zhao, Zhengyang
Ma, Huadong
CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2020, 23 (04): : 2689 - 2702
[8] A hybrid GPU cluster and volunteer computing platform for scalable deep learning
Kijsipongse, Ekasit
Piyatumrong, Apivadee
U-ruekolan, Suriya
JOURNAL OF SUPERCOMPUTING, 2018, 74 (07): : 3236 - 3263
[9] Multi-Tenant Deep Learning Acceleration with Competitive GPU Resource Sharing
Yu, Yongbo
Chen, Xiang
2023 IEEE CLOUD SUMMIT, 2023, : 49 - 51
[10] Vapor: A GPU Sharing Scheduler with Communication and Computation Pipeline for Distributed Deep Learning
Zhu, Xiaorui
Gong, Lei
Zhu, Zongwei
Zhou, Xuehai
19TH IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING WITH APPLICATIONS (ISPA/BDCLOUD/SOCIALCOM/SUSTAINCOM 2021), 2021, : 108 - 116

← 1 2 3 4 5 →