Empirical Performance Evaluation of Communication Libraries for Multi-GPU based Distributed Deep Learning in a Container Environment

被引：1

作者：

Choi, HyeonSeong ^{[1
]}

Kim, Youngrang ^{[2
]}

Lee, Jaehwan ^{[3
]}

Kim, Yoonhee ^{[4
]}

机构：

[1] Korea Aerosp Univ, KAU, Elect & Informat Engn, Goyang City, Gyeonggi Do, South Korea

[2] Korea Aerosp Univ, Goyang City, Gyeonggi Do, South Korea

[3] Korea Aerosp Univ, Dept Elect & Informat Engn, Goyang City, Gyeonggi Do, South Korea

[4] Sookmyung Womens Univ, Comp Sci Dept, Seoul, South Korea

来源：

KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS | 2021年 / 15卷 / 03期

基金：

新加坡国家研究基金会;

关键词：

Docker; Collective Communication; Distributed Deep Leaning; Multi-GPU; MPI;

D O I：

10.3837/tiis.2021.03.006

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Recently, most cloud services use Docker container environment to provide their services. However, there are no researches to evaluate the performance of communication libraries for multi-GPU based distributed deep learning in a Docker container environment. In this paper, we propose an efficient communication architecture for multi-GPU based deep learning in a Docker container environment by evaluating the performances of various communication libraries. We compare the performances of the parameter server architecture and the All reduce architecture, which are typical distributed deep learning architectures. Further, we analyze the performances of two separate multi-GPU resource allocation policies - allocating a single GPU to each Docker container and allocating multiple GPUs to each Docker container. We also experiment with the scalability of collective communication by increasing the number of GPUs from one to four. Through experiments, we compare OpenMPI and MPICH, which are representative open source MPI libraries, and NCCL, which is NVIDIA's collective communication library for the multi-GPU setting. In the parameter server architecture, we show that using CUDA-aware OpenMPI with multi-GPU per Docker container environment reduces communication latency by up to 75%. Also, we show that using NCCL in All-reduce architecture reduces communication latency by up to 93% compared to other libraries.

引用

页码：911 / 931

页数：21

共 50 条

[31] Multi-GPU Based Evaluation and Analysis of Prehistoric Ice Cores Using OpenCL
Ditter, Alexander
Schaffert, Roman
Fey, Dietmar
Schoen, Tobias
Gruber, Roland
2015 IEEE INTERNATIONAL CONFERENCE ON IMAGING SYSTEMS AND TECHNIQUES (IST) PROCEEDINGS, 2015, : 337 - 342
[32] 2.5D DEEP LEARNING FOR CT IMAGE RECONSTRUCTION USING A MULTI-GPU IMPLEMENTATION
Ziabari, Amirkoushyar
Ye, Dong Hye
Srivastava, Somesh
Sauer, Ken D.
Thibault, Jean-Baptiste
Bouman, Charles A.
2018 CONFERENCE RECORD OF 52ND ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS, AND COMPUTERS, 2018, : 2044 - 2049
[33] Performance Evaluation of Deep Learning Frameworks on Embedded GPU
Fang, Hao
Lan, Qiang
Shi, Yang
Wen, Mei
2016 INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND INFORMATION SECURITY (CSIS 2016), 2016, : 200 - 205
[34] High Performance Multi-GPU SpMV for Multi-component PDE-Based Applications
Abdelfattah, Ahmad
Ltaief, Hatem
Keyes, David
EURO-PAR 2015: PARALLEL PROCESSING, 2015, 9233 : 601 - 612
[35] Vapor: A GPU Sharing Scheduler with Communication and Computation Pipeline for Distributed Deep Learning
Zhu, Xiaorui
Gong, Lei
Zhu, Zongwei
Zhou, Xuehai
19TH IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING WITH APPLICATIONS (ISPA/BDCLOUD/SOCIALCOM/SUSTAINCOM 2021), 2021, : 108 - 116
[36] The Impact of GPU DVFS on the Energy and Performance of Deep Learning: an Empirical Study
Tang, Zhenheng
Wang, Yuxin
Wang, Qiang
Chu, Xiaowen
E-ENERGY'19: PROCEEDINGS OF THE 10TH ACM INTERNATIONAL CONFERENCE ON FUTURE ENERGY SYSTEMS, 2019, : 315 - 325
[37] A Generic Performance Model for Deep Learning in a Distributed Environment
Kavarakuntla, Tulasi
Han, Liangxiu
Lloyd, Huw
Latham, Annabel
Kleerekoper, Anthony
Akintoye, Samson B.
IEEE ACCESS, 2024, 12 : 8207 - 8219
[38] Comparative Performance Evaluation of Multi-GPU MLFMA Implementation for 2-D VIE Problems
Pearson, Carl
Hidayetoglu, Mert
Ren, Wei
Chew, Weng Cho
Hwu, Wen-mei
2017 COMPUTING AND ELECTROMAGNETICS INTERNATIONAL WORKSHOP (CEM'17), 2017, : 63 - 64
[39] cuTS: Scaling Subgraph Isomorphism on Distributed Multi-GPU Systems Using Trie Based Data Structure
Xiang, Lizhi
Khan, Arif
Serra, Edoardo
Halappanavar, Mahantesh
Sukumaran-Rajam, Aravind
SC21: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2021,
[40] Communication-Efficient Distributed Deep Learning with GPU-FPGA Heterogeneous Computing
Tanaka, Kenji
Arikawa, Yuki
Ito, Tsuyoshi
Morita, Kazutaka
Nemoto, Naru
Miura, Fumiaki
Terada, Kazuhiko
Teramoto, Junji
Sakamoto, Takeshi
2020 IEEE SYMPOSIUM ON HIGH-PERFORMANCE INTERCONNECTS (HOTI 2020), 2020, : 43 - 46

← 1 2 3 4 5 →