Fast and Efficient Model Serving Using Multi-GPUs with Direct-Host-Access

被引:9
|
作者
Jeong, Jinwoo [1 ]
Baek, Seungsu [1 ]
Ahn, Jeongseob [1 ]
机构
[1] Ajou Univ, Suwon, South Korea
关键词
DNN model serving; Direct-host-access; Parallel-transmission;
D O I
10.1145/3552326.3567508
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
As deep learning (DL) inference has been widely adopted for building user-facing applications in many domains, it is increasingly important for DL inference servers to achieve high throughput while preserving bounded latency. DL inference requests can be immediately served if the corresponding model is already in the GPU memory. Otherwise, it needs to load the model from host to GPU, adding a significant delay to inference. This paper proposes DeepPlan to minimize inference latency while provisioning DL models from host to GPU in server environments. First, we take advantage of the direct-host-access facility provided by commodity GPUs, allowing access to particular layers of models in the host memory directly from GPU without loading. Second, we parallelize model transmission across multiple GPUs to reduce the time for loading models from host to GPU. We show that a single inference can achieve a 1.94x speedup compared with the state-of-the-art pipelining approach for BERT-Base. When deploying multiple BERT, RoBERTa, and GPT-2 instances on a DL inference serving system, DeepPlan shows a significant performance improvement compared to the pipelining technique and stable 99% tail latency.
引用
收藏
页码:249 / 265
页数:17
相关论文
共 50 条
  • [1] Fast evolutionary image processing using multi-GPUs
    Ando, Jun
    Nagao, Tomoharu
    2007 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS, VOLS 1-8, 2007, : 1518 - +
  • [2] Efficient Parallel Algorithm for Compound Comparisons on Multi-GPUs
    Lin, Chun-Yuan
    Wang, Chung-Hung
    Hung, Che-Lun
    Lin, Yu-Shiang
    2014 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2014,
  • [3] Efficient Parallel Knuth-Morris-Pratt Algorithm for Multi-GPUs with CUDA
    Lin, K.-J. (g548462@gmail.com), 2013, Springer Science and Business Media Deutschland GmbH (21):
  • [4] Efficient isogeometric topology optimization via multi-GPUs and CPUs heterogeneous architecture
    Han, Jinpeng
    Zhang, Haobo
    Gao, Baichuan
    Yu, Jingui
    Jin, Peng
    Yang, Jianzhong
    Xia, Zhaohui
    OPTIMIZATION AND ENGINEERING, 2024,
  • [5] CUDA ClustalW: An efficient parallel algorithm for progressive multiple sequence alignment on Multi-GPUs
    Hung, Che-Lun
    Lin, Yu-Shiang
    Lin, Chun-Yuan
    Chung, Yeh-Ching
    Chung, Yi-Fang
    COMPUTATIONAL BIOLOGY AND CHEMISTRY, 2015, 58 : 62 - 68
  • [6] A Framework for Direct and Transparent Data Exchange of Filter-stream Applications in Multi-GPUs Architectures
    Ramos, Gabriel
    Andrade, Guilherme
    Sachetto, Rafael
    Madeira, Daniel
    Carvalho, Renan
    Ferreira, Renato
    Mourao, Fernando
    Rocha, Leonardo
    INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE (ICCS 2017), 2017, 108 : 1642 - 1651
  • [7] Particle-resolved thermal lattice Boltzmann simulation using OpenACC on multi-GPUs
    Xu, Ao
    Li, Bo-Tao
    INTERNATIONAL JOURNAL OF HEAT AND MASS TRANSFER, 2024, 218
  • [8] Multi-GPUs parallel computation of dendrite growth in forced convection using the phase-field-lattice Boltzmann model
    Sakane, Shinji
    Takaki, Tomohiro
    Rojas, Roberto
    Ohno, Munekazu
    Shibuta, Yasushi
    Shimokawabe, Takashi
    Aoki, Takayuki
    JOURNAL OF CRYSTAL GROWTH, 2017, 474 : 154 - 159
  • [9] Accelerating Time-Domain SAR Raw Data Simulation for Large Areas Using Multi-GPUs
    Zhang, Fan
    Hu, Chen
    Li, Wei
    Hu, Wei
    Li, Heng-Chao
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2014, 7 (09) : 3956 - 3966
  • [10] Accelerating Multiple Compound Comparison Using LINGO-Based Load-Balancing Strategies on Multi-GPUs
    Lin, Chun-Yuan
    Wang, Chung-Hung
    Hung, Che-Lun
    Lin, Yu-Shiang
    INTERNATIONAL JOURNAL OF GENOMICS, 2015, 2015