Accelerating Broadcast Communication with GPU Compression for Deep Learning Workloads

被引:2
|
作者
Zhou, Qinghua [1 ]
Anthony, Quentin [1 ]
Shafi, Aamir [1 ]
Subramoni, Hari [1 ]
Panda, Dhabaleswar K. [1 ]
机构
[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA
关键词
Broadcast; Compression; GPU-Aware MPI; Deep Learning;
D O I
10.1109/HiPC56025.2022.00016
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
With the rapidly increasing model sizes, state-ofthe-art Deep Learning (DL) models rely on multiple GPU nodes to run distributed training. Large message communication of GPU data between the GPUs is becoming a performance bottleneck in the overall training performance. GPU-Aware MPI libraries are widely adopted for state-of-the-art DL frameworks to improve communication performance. In the existing optimization solutions for Distributed Data-Parallel (DDP) training, the broadcast operation is often utilized to sync up the updated model parameters among all the GPUs. However, for state-of-the-art GPU-Aware MPI libraries, broadcasting large GPU data turns to overburden the training performance due to the limited bandwidth of interconnect between the GPU nodes. On the other hand, the recent research on using GPU-based compression libraries to lower the pressure on the nearly saturated interconnection and co-designing online compression with the communication pattern provides a new perspective to optimize the performance of broadcast on modern GPU clusters. In this paper, we redesign the GPU-Aware MPI library to enable efficient collective-level online compression with an optimized chunked-chain scheme for large message broadcast communication. The proposed design is evaluated to show benefits at both microbenchmark and application levels. At the microbenchmark level, the proposed design can reduce the broadcast communication latency by up to 80.9% compared to the baseline using a state-of-the-art MPI library and 55.1% compared to the existing point-to-point-based compression on modern GPU clusters. For DDP training with PyTorch, the proposed design reduces the training time by up to 15.0% and 6.4% compared to the existing chunked-chain scheme and pointto-point-based compression, respectively, while keeping similar training accuracy. To the best of our knowledge, this is the first work that leverages online GPU-based compression techniques to significantly accelerate broadcast communication for DL workloads.
引用
收藏
页码:22 / 31
页数:10
相关论文
共 50 条
  • [1] Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?
    Awan, Ammar Ahmad
    Subramoni, Hari
    Chu, Ching-Hsiang
    Panda, Dhabaleswar K.
    [J]. EUROMPI 2018: PROCEEDINGS OF THE 25TH EUROPEAN MPI USERS' GROUP MEETING, 2018,
  • [2] Understanding of GPU Architectural Vulnerability for Deep Learning Workloads
    Santoso, Danny
    Jeon, Hyeran
    [J]. 2019 IEEE INTERNATIONAL SYMPOSIUM ON DEFECT AND FAULT TOLERANCE IN VLSI AND NANOTECHNOLOGY SYSTEMS (DFT), 2019,
  • [3] Optimizing Deep Learning Workloads on ARM GPU with TVM
    Zheng, Lianmin
    Chen, Tianqi
    [J]. 1ST ACM REQUEST WORKSHOP/TOURNAMENT ON REPRODUCIBLE SOFTWARE/HARDWARE CO-DESIGN OF PARETO-EFFICIENT DEEP LEARNING, 2018,
  • [4] Evaluating On-Node GPU Interconnects for Deep Learning Workloads
    Tallent, Nathan R.
    Gawande, Nitin A.
    Siegel, Charles
    Vishnu, Abhinav
    Hoisie, Adolfy
    [J]. HIGH PERFORMANCE COMPUTING SYSTEMS: PERFORMANCE MODELING, BENCHMARKING, AND SIMULATION (PMBS 2017), 2018, 10724 : 3 - 21
  • [5] Reliability of Large Scale GPU Clusters for Deep Learning Workloads
    Qian, Junjie
    Kim, Taeyoon
    Jeon, Myeongjae
    [J]. WEB CONFERENCE 2021: COMPANION OF THE WORLD WIDE WEB CONFERENCE (WWW 2021), 2021, : 179 - 181
  • [6] Poster Abstract: Deep Learning Workloads Scheduling with Reinforcement Learning on GPU Clusters
    Chen, Zhaoyun
    Luo, Lei
    Quan, Wei
    Wen, Mei
    Zhang, Chunyuan
    [J]. IEEE CONFERENCE ON COMPUTER COMMUNICATIONS WORKSHOPS (IEEE INFOCOM 2019 WKSHPS), 2019, : 1023 - 1024
  • [7] Characterization and Prediction of Deep Learning Workloads in Large -Scale GPU Datacenters
    Hu, Qinghao
    Sun, Peng
    Yan, Shengen
    Wen, Yonggang
    Zhang, Tianwei
    [J]. SC21: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2021,
  • [8] Predicting GPU Failures With High Precision Under Deep Learning Workloads
    Liu, Heting
    Li, Zhichao
    Tan, Cheng
    Yang, Rongqiu
    Cao, Guohong
    Liu, Zherui
    Guo, Chuanxiong
    [J]. PROCEEDINGS OF THE 16TH ACM INTERNATIONAL SYSTEMS AND STORAGE CONFERENCE, SYSTOR 2023, 2023, : 124 - 135
  • [9] Accelerating Container-based Deep Learning Hyperparameter Optimization Workloads
    Liu, Rui
    Wong, David
    Lange, Dave
    Larsson, Patrik
    Jethava, Vinay
    Zheng, Qing
    [J]. PROCEEDINGS OF THE 6TH WORKSHOP ON DATA MANAGEMENT FOR END-TO-END MACHINE LEARNING, DEEM 2022, 2022,
  • [10] DistDL: A Distributed Deep Learning Service Schema with GPU Accelerating
    Wang, Jianzong
    Cheng, Lianglun
    [J]. WEB TECHNOLOGIES AND APPLICATIONS (APWEB 2015), 2015, 9313 : 793 - 804