Empirical Performance Evaluation of Communication Libraries for Multi-GPU based Distributed Deep Learning in a Container Environment

被引:1
|
作者
Choi, HyeonSeong [1 ]
Kim, Youngrang [2 ]
Lee, Jaehwan [3 ]
Kim, Yoonhee [4 ]
机构
[1] Korea Aerosp Univ, KAU, Elect & Informat Engn, Goyang City, Gyeonggi Do, South Korea
[2] Korea Aerosp Univ, Goyang City, Gyeonggi Do, South Korea
[3] Korea Aerosp Univ, Dept Elect & Informat Engn, Goyang City, Gyeonggi Do, South Korea
[4] Sookmyung Womens Univ, Comp Sci Dept, Seoul, South Korea
基金
新加坡国家研究基金会;
关键词
Docker; Collective Communication; Distributed Deep Leaning; Multi-GPU; MPI;
D O I
10.3837/tiis.2021.03.006
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recently, most cloud services use Docker container environment to provide their services. However, there are no researches to evaluate the performance of communication libraries for multi-GPU based distributed deep learning in a Docker container environment. In this paper, we propose an efficient communication architecture for multi-GPU based deep learning in a Docker container environment by evaluating the performances of various communication libraries. We compare the performances of the parameter server architecture and the All reduce architecture, which are typical distributed deep learning architectures. Further, we analyze the performances of two separate multi-GPU resource allocation policies - allocating a single GPU to each Docker container and allocating multiple GPUs to each Docker container. We also experiment with the scalability of collective communication by increasing the number of GPUs from one to four. Through experiments, we compare OpenMPI and MPICH, which are representative open source MPI libraries, and NCCL, which is NVIDIA's collective communication library for the multi-GPU setting. In the parameter server architecture, we show that using CUDA-aware OpenMPI with multi-GPU per Docker container environment reduces communication latency by up to 75%. Also, we show that using NCCL in All-reduce architecture reduces communication latency by up to 93% compared to other libraries.
引用
收藏
页码:911 / 931
页数:21
相关论文
共 50 条
  • [21] Analysis of large deviations behavior of multi-GPU memory access in deep learning
    P. S. Tamizharasan
    N. Ramasubramanian
    The Journal of Supercomputing, 2018, 74 : 2199 - 2212
  • [22] CROSSBOW: Scaling Deep Learning with Small Batch Sizes on Multi-GPU Servers
    Koliousis, Alexandros
    Watcharapichat, Pijika
    Weidlich, Matthias
    Mai, Luo
    Costa, Paolo
    Pietzuch, Peter
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2019, 12 (11): : 1399 - 1413
  • [23] Adaptive Communication for Distributed Deep Learning on Commodity GPU Cluster
    Ho, Li-Yung
    Wu, Jan-Jan
    Liu, Pangfeng
    2018 18TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2018, : 283 - 290
  • [24] Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms
    Lin, Zhongyi
    Sun, Ning
    Bhattacharya, Pallab
    Feng, Xizhou
    Feng, Louis
    Owens, John D.
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2025, 36 (02) : 226 - 238
  • [25] Efficient Large-scale Deep Learning Framework for Heterogeneous Multi-GPU Cluster
    Kim, Youngrang
    Choi, Hyeonseong
    Lee, Jaehwan
    Kim, Jik-Soo
    Jei, Hyunseung
    Roh, Hongchan
    2019 IEEE 4TH INTERNATIONAL WORKSHOPS ON FOUNDATIONS AND APPLICATIONS OF SELF* SYSTEMS (FAS*W 2019), 2019, : 176 - 181
  • [26] Parallel Computing Model and Performance Prediction based on Multi-GPU Environments
    Wang, Zhuowei
    Xu, Xianbin
    Zhao, Wuqing
    2011 INTERNATIONAL CONFERENCE ON FUTURE COMPUTERS IN EDUCATION (ICFCE 2011), VOL I, 2011, : 309 - 312
  • [27] Performance Evaluation of a Multi-GPU Enabled Finite Element Method for Computational Electromagnetics
    Cabel, Tristan
    Charles, Joseph
    Lanteri, Stephane
    EURO-PAR 2011: PARALLEL PROCESSING WORKSHOPS, PT II, 2012, 7156 : 355 - 364
  • [28] Multi-GPU Server Deign Parameters Selection based on Empirical Observation of HPL Behavior
    Kim, Young Woo
    2021 36TH INTERNATIONAL TECHNICAL CONFERENCE ON CIRCUITS/SYSTEMS, COMPUTERS AND COMMUNICATIONS (ITC-CSCC), 2021,
  • [29] Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters
    Zhang, Hao
    Zheng, Zeyu
    Xu, Shizhen
    Dai, Wei
    Hoe, Qirong
    Liang, Xiaodan
    Hu, Zhiting
    Weil, Jinliang
    Xie, Pengtao
    Xing, Eric P.
    2017 USENIX ANNUAL TECHNICAL CONFERENCE (USENIX ATC '17), 2017, : 181 - 193
  • [30] High-Performance Adaptive MPI Derived Datatype Communication for Modern Multi-GPU Systems
    Chu, Ching-Hsiang
    Hashmi, Jahanzeb Maqbool
    Khorassani, Kawthar Shafie
    Subramoni, Hari
    Panda, Dhabaleswar K.
    2019 IEEE 26TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, AND ANALYTICS (HIPC), 2019, : 267 - 276