DxPU: Large-scale Disaggregated GPU Pools in the Datacenter

被引:0
|
作者
He, Bowen [1 ,2 ]
Zheng, Xiao [2 ]
Chen, Yuan [1 ,2 ]
Li, Weinan [2 ]
Zhou, Yajin [1 ]
Long, Xin [2 ]
Zhang, Pengcheng [2 ]
Lu, Xiaowei [2 ]
Jiang, Linquan [2 ]
Liu, Qiang [2 ]
Cai, Dennis [2 ]
Zhang, Xiantao [2 ]
机构
[1] Zhejiang Univ, Hangzhou, Peoples R China
[2] Alibaba Grp, Hangzhou, Peoples R China
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
Clouds; clusters; data centers;
D O I
10.1145/3617995
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The rapid adoption of AI and convenience offered by cloud services have resulted in the growing demands for GPUs in the cloud. Generally, GPUs are physically attached to host servers as PCIe devices. However, the fixed assembly combination of host servers and GPUs is extremely inefficient in resource utilization, upgrade, and maintenance. Due to these issues, the GPU disaggregation technique has been proposed to decouple GPUs from host servers. It aggregates GPUs into a pool and allocates GPU node(s) according to user demands. However, existing GPU disaggregation systems have flaws in software-hardware compatibility, disaggregation scope, and capacity. In this article, we present a new implementation of datacenter-scale GPU disaggregation, named DxPU. DxPU efficiently solves the above problems and can flexibly allocate as many GPU node(s) as users demand. To understand the performance overhead incurred by DxPU, we build up a performance model for AI specific workloads. With the guidance of modeling results, we develop a prototype system, which has been deployed into the datacenter of a leading cloud provider for a test run. We also conduct detailed experiments to evaluate the performance overhead caused by our system. The results show that the overhead of DxPU is less than 10%, compared with native GPU servers, in most of user scenarios.
引用
下载
收藏
页数:23
相关论文
共 50 条
  • [21] SELF-PACED ECONOMICS INSTRUCTION - LARGE-SCALE DISAGGREGATED EVALUATION
    SOPER, JC
    THORNTON, RM
    JOURNAL OF ECONOMIC EDUCATION, 1976, 7 (02): : 81 - 91
  • [22] A Traffic Visualization Framework for Monitoring Large-scale Inter- DataCenter Network
    Elbaham, Meryem
    Nguyen, Kim Khoa
    Cheriet, Mohammed
    2016 12TH INTERNATIONAL CONFERENCE ON NETWORK AND SERVICE MANAGEMENT AND WORKSHOPS(CNSM 2016), 2016, : 277 - 281
  • [23] Large-Scale Optical Circuit Switch Architecture for Intra-Datacenter Networking
    Mori, Yojiro
    Sato, Ken-ichi
    2018 OPTICAL FIBER COMMUNICATIONS CONFERENCE AND EXPOSITION (OFC), 2018,
  • [24] A Virtual Network to Achieve Low Energy Consumption in Optical Large-scale Datacenter
    Tarutani, Yuya
    Ohsita, Yuichi
    Murata, Masayuki
    2012 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATION SYSTEMS (IEEE ICCS 2012), 2012, : 45 - 49
  • [25] Recent Progress in Large-scale Optical Switches for Intra-datacenter Interconnection
    Ueda, Koh
    Mori, Yojiro
    Hasegawa, Hiroshi
    Sato, Ken-Ichi
    2016 PROGRESS IN ELECTROMAGNETICS RESEARCH SYMPOSIUM (PIERS), 2016, : 2369 - 2373
  • [26] Throughput optimization of TCP incast congestion control in large-scale datacenter networks
    Xu, Lei
    Xu, Ke
    Jiang, Yong
    Ren, Fengyuan
    Wang, Haiyang
    COMPUTER NETWORKS, 2017, 124 : 46 - 60
  • [27] Enhancing TCP Incast Congestion Control Over Large-scale Datacenter Networks
    Xu, Lei
    Xu, Ke
    Jiang, Yong
    Ren, Fengyuan
    Wang, Haiyang
    2015 IEEE 23RD INTERNATIONAL SYMPOSIUM ON QUALITY OF SERVICE (IWQOS), 2015, : 225 - 230
  • [28] Accelerating large-scale phase-field simulations with GPU
    Shi, Xiaoming
    Huang, Houbing
    Cao, Guoping
    Ma, Xingqiao
    AIP ADVANCES, 2017, 7 (10):
  • [29] Collective behavior of large-scale neural networks with GPU acceleration
    Jingyi Qu
    Rubin Wang
    Cognitive Neurodynamics, 2017, 11 : 553 - 563
  • [30] Graphie: Large-Scale Asynchronous Graph Traversals on Just a GPU
    Han, Wei
    Mawhirter, Daniel
    Wu, Bo
    Buland, Matthew
    2017 26TH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT), 2017, : 233 - 245