DxPU: Large-scale Disaggregated GPU Pools in the Datacenter

被引：0

作者：

He, Bowen ^{[1
,2
]}

Zheng, Xiao ^{[2
]}

Chen, Yuan ^{[1
,2
]}

Li, Weinan ^{[2
]}

Zhou, Yajin ^{[1
]}

Long, Xin ^{[2
]}

Zhang, Pengcheng ^{[2
]}

Lu, Xiaowei ^{[2
]}

Jiang, Linquan ^{[2
]}

Liu, Qiang ^{[2
]}

Cai, Dennis ^{[2
]}

Zhang, Xiantao ^{[2
]}

机构：

[1] Zhejiang Univ, Hangzhou, Peoples R China

[2] Alibaba Grp, Hangzhou, Peoples R China

来源：

ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION | 2023年 / 20卷 / 04期

基金：

国家重点研发计划; 中国国家自然科学基金;

关键词：

Clouds; clusters; data centers;

D O I：

10.1145/3617995

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The rapid adoption of AI and convenience offered by cloud services have resulted in the growing demands for GPUs in the cloud. Generally, GPUs are physically attached to host servers as PCIe devices. However, the fixed assembly combination of host servers and GPUs is extremely inefficient in resource utilization, upgrade, and maintenance. Due to these issues, the GPU disaggregation technique has been proposed to decouple GPUs from host servers. It aggregates GPUs into a pool and allocates GPU node(s) according to user demands. However, existing GPU disaggregation systems have flaws in software-hardware compatibility, disaggregation scope, and capacity. In this article, we present a new implementation of datacenter-scale GPU disaggregation, named DxPU. DxPU efficiently solves the above problems and can flexibly allocate as many GPU node(s) as users demand. To understand the performance overhead incurred by DxPU, we build up a performance model for AI specific workloads. With the guidance of modeling results, we develop a prototype system, which has been deployed into the datacenter of a leading cloud provider for a test run. We also conduct detailed experiments to evaluate the performance overhead caused by our system. The results show that the overhead of DxPU is less than 10%, compared with native GPU servers, in most of user scenarios.

引用

下载

页数：23

共 50 条

[21] SELF-PACED ECONOMICS INSTRUCTION - LARGE-SCALE DISAGGREGATED EVALUATION
SOPER, JC
THORNTON, RM
JOURNAL OF ECONOMIC EDUCATION, 1976, 7 (02): : 81 - 91
[22] A Traffic Visualization Framework for Monitoring Large-scale Inter- DataCenter Network
Elbaham, Meryem
Nguyen, Kim Khoa
Cheriet, Mohammed
2016 12TH INTERNATIONAL CONFERENCE ON NETWORK AND SERVICE MANAGEMENT AND WORKSHOPS(CNSM 2016), 2016, : 277 - 281
[23] Large-Scale Optical Circuit Switch Architecture for Intra-Datacenter Networking
Mori, Yojiro
Sato, Ken-ichi
2018 OPTICAL FIBER COMMUNICATIONS CONFERENCE AND EXPOSITION (OFC), 2018,
[24] A Virtual Network to Achieve Low Energy Consumption in Optical Large-scale Datacenter
Tarutani, Yuya
Ohsita, Yuichi
Murata, Masayuki
2012 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATION SYSTEMS (IEEE ICCS 2012), 2012, : 45 - 49
[25] Recent Progress in Large-scale Optical Switches for Intra-datacenter Interconnection
Ueda, Koh
Mori, Yojiro
Hasegawa, Hiroshi
Sato, Ken-Ichi
2016 PROGRESS IN ELECTROMAGNETICS RESEARCH SYMPOSIUM (PIERS), 2016, : 2369 - 2373
[26] Throughput optimization of TCP incast congestion control in large-scale datacenter networks
Xu, Lei
Xu, Ke
Jiang, Yong
Ren, Fengyuan
Wang, Haiyang
COMPUTER NETWORKS, 2017, 124 : 46 - 60
[27] Enhancing TCP Incast Congestion Control Over Large-scale Datacenter Networks
Xu, Lei
Xu, Ke
Jiang, Yong
Ren, Fengyuan
Wang, Haiyang
2015 IEEE 23RD INTERNATIONAL SYMPOSIUM ON QUALITY OF SERVICE (IWQOS), 2015, : 225 - 230
[28] Accelerating large-scale phase-field simulations with GPU
Shi, Xiaoming
Huang, Houbing
Cao, Guoping
Ma, Xingqiao
AIP ADVANCES, 2017, 7 (10):
[29] Collective behavior of large-scale neural networks with GPU acceleration
Jingyi Qu
Rubin Wang
Cognitive Neurodynamics, 2017, 11 : 553 - 563
[30] Graphie: Large-Scale Asynchronous Graph Traversals on Just a GPU
Han, Wei
Mawhirter, Daniel
Wu, Bo
Buland, Matthew
2017 26TH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT), 2017, : 233 - 245

← 1 2 3 4 5 →