Fast and Efficient Model Serving Using Multi-GPUs with Direct-Host-Access

被引：9

作者：

Jeong, Jinwoo ^{[1
]}

Baek, Seungsu ^{[1
]}

Ahn, Jeongseob ^{[1
]}

机构：

[1] Ajou Univ, Suwon, South Korea

来源：

PROCEEDINGS OF THE EIGHTEENTH EUROPEAN CONFERENCE ON COMPUTER SYSTEMS, EUROSYS 2023 | 2023年

关键词：

DNN model serving; Direct-host-access; Parallel-transmission;

D O I：

10.1145/3552326.3567508

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

As deep learning (DL) inference has been widely adopted for building user-facing applications in many domains, it is increasingly important for DL inference servers to achieve high throughput while preserving bounded latency. DL inference requests can be immediately served if the corresponding model is already in the GPU memory. Otherwise, it needs to load the model from host to GPU, adding a significant delay to inference. This paper proposes DeepPlan to minimize inference latency while provisioning DL models from host to GPU in server environments. First, we take advantage of the direct-host-access facility provided by commodity GPUs, allowing access to particular layers of models in the host memory directly from GPU without loading. Second, we parallelize model transmission across multiple GPUs to reduce the time for loading models from host to GPU. We show that a single inference can achieve a 1.94x speedup compared with the state-of-the-art pipelining approach for BERT-Base. When deploying multiple BERT, RoBERTa, and GPT-2 instances on a DL inference serving system, DeepPlan shows a significant performance improvement compared to the pipelining technique and stable 99% tail latency.

引用

页码：249 / 265

页数：17

共 50 条

[1] Fast evolutionary image processing using multi-GPUs
Ando, Jun
Nagao, Tomoharu
2007 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS, VOLS 1-8, 2007, : 1518 - +
[2] Efficient Parallel Algorithm for Compound Comparisons on Multi-GPUs
Lin, Chun-Yuan
Wang, Chung-Hung
Hung, Che-Lun
Lin, Yu-Shiang
2014 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2014,
[3] Efficient Parallel Knuth-Morris-Pratt Algorithm for Multi-GPUs with CUDA
Lin, K.-J. (g548462@gmail.com), 2013, Springer Science and Business Media Deutschland GmbH (21):
[4] Efficient isogeometric topology optimization via multi-GPUs and CPUs heterogeneous architecture
Han, Jinpeng
Zhang, Haobo
Gao, Baichuan
Yu, Jingui
Jin, Peng
Yang, Jianzhong
Xia, Zhaohui
OPTIMIZATION AND ENGINEERING, 2024,
[5] CUDA ClustalW: An efficient parallel algorithm for progressive multiple sequence alignment on Multi-GPUs
Hung, Che-Lun
Lin, Yu-Shiang
Lin, Chun-Yuan
Chung, Yeh-Ching
Chung, Yi-Fang
COMPUTATIONAL BIOLOGY AND CHEMISTRY, 2015, 58 : 62 - 68
[6] A Framework for Direct and Transparent Data Exchange of Filter-stream Applications in Multi-GPUs Architectures
Ramos, Gabriel
Andrade, Guilherme
Sachetto, Rafael
Madeira, Daniel
Carvalho, Renan
Ferreira, Renato
Mourao, Fernando
Rocha, Leonardo
INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE (ICCS 2017), 2017, 108 : 1642 - 1651
[7] Particle-resolved thermal lattice Boltzmann simulation using OpenACC on multi-GPUs
Xu, Ao
Li, Bo-Tao
INTERNATIONAL JOURNAL OF HEAT AND MASS TRANSFER, 2024, 218
[8] Multi-GPUs parallel computation of dendrite growth in forced convection using the phase-field-lattice Boltzmann model
Sakane, Shinji
Takaki, Tomohiro
Rojas, Roberto
Ohno, Munekazu
Shibuta, Yasushi
Shimokawabe, Takashi
Aoki, Takayuki
JOURNAL OF CRYSTAL GROWTH, 2017, 474 : 154 - 159
[9] Accelerating Time-Domain SAR Raw Data Simulation for Large Areas Using Multi-GPUs
Zhang, Fan
Hu, Chen
Li, Wei
Hu, Wei
Li, Heng-Chao
IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2014, 7 (09) : 3956 - 3966
[10] Accelerating Multiple Compound Comparison Using LINGO-Based Load-Balancing Strategies on Multi-GPUs
Lin, Chun-Yuan
Wang, Chung-Hung
Hung, Che-Lun
Lin, Yu-Shiang
INTERNATIONAL JOURNAL OF GENOMICS, 2015, 2015

← 1 2 3 4 5 →