HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees

被引:0
|
作者
Zhao, Hanyu [1 ,3 ]
Han, Zhenhua [2 ,3 ]
Yang, Zhi [1 ]
Zhang, Quanlu [3 ]
Yang, Fan [3 ]
Zhou, Lidong [3 ]
Yang, Mao [3 ]
Lau, Francis C. M. [2 ]
Wang, Yuqi [3 ]
Xiong, Yifan [3 ]
Wang, Bin [3 ]
机构
[1] Peking Univ, Beijing, Peoples R China
[2] Univ Hong Kong, Hong Kong, Peoples R China
[3] Microsoft, Redmond, WA 98052 USA
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Deep learning training on a shared GPU cluster is becoming a common practice. However, we observe severe sharing anomaly in production multi-tenant clusters where jobs in some tenants experience worse queuing delay than they would have in a private cluster with their allocated shares of GPUs. This is because tenants use quota, the number of GPUs, to reserve resources, whereas deep learning jobs often use GPUs with a desirable GPU affinity, which quota cannot guarantee. HiveD is the first framework to share a GPU cluster safely, so that such anomaly would never happen by design. In HiveD, each tenant reserves resources through a Virtual Private Cluster (VC), defined in terms of multi-level cell structures corresponding to different levels of GPU affinity in a cluster. This design allows HiveD to incorporate any existing schedulers within each VC to achieve their respective design goals while sharing the cluster safely. HiveD develops an elegant buddy cell allocation algorithm to ensure sharing safety by efficiently managing the dynamic binding of cells from VCs to those in a physical cluster. A straightforward extension of buddy cell allocation can further support low-priority jobs to scavenge the unused GPU resources to improve cluster utilization. With a combination of real deployment and trace-driven simulation, we show that: (i) sharing anomaly exists in three state-of-the-art deep learning schedulers, incurring extra queuing delay of up to 1,000 minutes; (ii) HiveD can incorporate these schedulers and eliminate the sharing anomaly in all of them, achieving separation of concerns that allows the schedulers to focus on their own scheduling goals without violating sharing safety.
引用
收藏
页码:515 / 532
页数:18
相关论文
共 50 条
  • [1] Tiresias: A GPU Cluster Manager for Distributed Deep Learning
    Gu, Juncheng
    Chowdhury, Mosharaf
    Shin, Kang G.
    Zhu, Yibo
    Jeon, Myeongjae
    Qian, Junjie
    Liu, Hongqiang
    Guo, Chuanxiong
    PROCEEDINGS OF THE 16TH USENIX SYMPOSIUM ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION, 2019, : 485 - 500
  • [2] Adaptive Communication for Distributed Deep Learning on Commodity GPU Cluster
    Ho, Li-Yung
    Wu, Jan-Jan
    Liu, Pangfeng
    2018 18TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2018, : 283 - 290
  • [3] Deep Learning Research and Development Platform: Characterizing and Scheduling with QoS Guarantees on GPU Clusters
    Chen, Zhaoyun
    Quan, Wei
    Wen, Mei
    Fang, Jianbin
    Yu, Jie
    Zhang, Chunyuan
    Luo, Lei
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2020, 31 (01) : 34 - 50
  • [4] Cost Efficient GPU Cluster Management for Training and Inference of Deep Learning
    Kang, Dong-Ki
    Lee, Ki-Beom
    Kim, Young-Chon
    ENERGIES, 2022, 15 (02)
  • [5] A hybrid GPU cluster and volunteer computing platform for scalable deep learning
    Ekasit Kijsipongse
    Apivadee Piyatumrong
    Suriya U-ruekolan
    The Journal of Supercomputing, 2018, 74 : 3236 - 3263
  • [6] Interference-aware parallelization for deep learning workload in GPU cluster
    Xin Geng
    Haitao Zhang
    Zhengyang Zhao
    Huadong Ma
    Cluster Computing, 2020, 23 : 2689 - 2702
  • [7] Interference-aware parallelization for deep learning workload in GPU cluster
    Geng, Xin
    Zhang, Haitao
    Zhao, Zhengyang
    Ma, Huadong
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2020, 23 (04): : 2689 - 2702
  • [8] A hybrid GPU cluster and volunteer computing platform for scalable deep learning
    Kijsipongse, Ekasit
    Piyatumrong, Apivadee
    U-ruekolan, Suriya
    JOURNAL OF SUPERCOMPUTING, 2018, 74 (07): : 3236 - 3263
  • [9] Multi-Tenant Deep Learning Acceleration with Competitive GPU Resource Sharing
    Yu, Yongbo
    Chen, Xiang
    2023 IEEE CLOUD SUMMIT, 2023, : 49 - 51
  • [10] Vapor: A GPU Sharing Scheduler with Communication and Computation Pipeline for Distributed Deep Learning
    Zhu, Xiaorui
    Gong, Lei
    Zhu, Zongwei
    Zhou, Xuehai
    19TH IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING WITH APPLICATIONS (ISPA/BDCLOUD/SOCIALCOM/SUSTAINCOM 2021), 2021, : 108 - 116