Accelerating Broadcast Communication with GPU Compression for Deep Learning Workloads

被引:2
|
作者
Zhou, Qinghua [1 ]
Anthony, Quentin [1 ]
Shafi, Aamir [1 ]
Subramoni, Hari [1 ]
Panda, Dhabaleswar K. [1 ]
机构
[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA
关键词
Broadcast; Compression; GPU-Aware MPI; Deep Learning;
D O I
10.1109/HiPC56025.2022.00016
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
With the rapidly increasing model sizes, state-ofthe-art Deep Learning (DL) models rely on multiple GPU nodes to run distributed training. Large message communication of GPU data between the GPUs is becoming a performance bottleneck in the overall training performance. GPU-Aware MPI libraries are widely adopted for state-of-the-art DL frameworks to improve communication performance. In the existing optimization solutions for Distributed Data-Parallel (DDP) training, the broadcast operation is often utilized to sync up the updated model parameters among all the GPUs. However, for state-of-the-art GPU-Aware MPI libraries, broadcasting large GPU data turns to overburden the training performance due to the limited bandwidth of interconnect between the GPU nodes. On the other hand, the recent research on using GPU-based compression libraries to lower the pressure on the nearly saturated interconnection and co-designing online compression with the communication pattern provides a new perspective to optimize the performance of broadcast on modern GPU clusters. In this paper, we redesign the GPU-Aware MPI library to enable efficient collective-level online compression with an optimized chunked-chain scheme for large message broadcast communication. The proposed design is evaluated to show benefits at both microbenchmark and application levels. At the microbenchmark level, the proposed design can reduce the broadcast communication latency by up to 80.9% compared to the baseline using a state-of-the-art MPI library and 55.1% compared to the existing point-to-point-based compression on modern GPU clusters. For DDP training with PyTorch, the proposed design reduces the training time by up to 15.0% and 6.4% compared to the existing chunked-chain scheme and pointto-point-based compression, respectively, while keeping similar training accuracy. To the best of our knowledge, this is the first work that leverages online GPU-based compression techniques to significantly accelerate broadcast communication for DL workloads.
引用
收藏
页码:22 / 31
页数:10
相关论文
共 50 条
  • [21] Accelerating Deep Learning Tasks with Optimized GPU-assisted Image Decoding
    Wang, Lipeng
    Luo, Qiong
    Yan, Shengen
    [J]. 2020 IEEE 26TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2020, : 274 - 281
  • [22] Accelerating text mining workloads in a MapReduce-based distributed GPU environment
    Wittek, Peter
    Daranyi, Sandor
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2013, 73 (02) : 198 - 206
  • [23] DASH: Scheduling Deep Learning Workloads on Multi-Generational GPU-Accelerated Clusters
    Li, Baolin
    Patel, Tirthak
    Gadepally, Vijay
    Gettings, Karen
    Samsi, Siddharth
    Tiwari, Devesh
    [J]. 2022 IEEE HIGH PERFORMANCE EXTREME COMPUTING VIRTUAL CONFERENCE (HPEC), 2022,
  • [24] Adaptive Communication for Distributed Deep Learning on Commodity GPU Cluster
    Ho, Li-Yung
    Wu, Jan-Jan
    Liu, Pangfeng
    [J]. 2018 18TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2018, : 283 - 290
  • [25] Communication Optimization Schemes for Accelerating Distributed Deep Learning Systems
    Lee, Jaehwan
    Choi, Hyeonseong
    Jeong, Hyeonwoo
    Noh, Baekhyeon
    Shin, Ji Sun
    [J]. APPLIED SCIENCES-BASEL, 2020, 10 (24): : 1 - 15
  • [26] SiP Architecture For Accelerating Collective Communication in Distributed Deep Learning
    Wu, Zhenguo
    Dai, Liang Yuan
    Zhu, Ziyi
    Novick, Asher
    Glick, Madeleine
    Bergman, Keren
    [J]. 2023 OPTICAL FIBER COMMUNICATIONS CONFERENCE AND EXHIBITION, OFC, 2023,
  • [27] Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning
    Chu, Ching-Hsiang
    Lu, Xiaoyi
    Awan, Ammar A.
    Subramoni, Hari
    Hashmi, Jahanzeb
    Elton, Bracy
    Panda, Dhabaleswar K.
    [J]. 2017 46TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP), 2017, : 161 - 170
  • [28] ScissionLite: Accelerating Distributed Deep Learning With Lightweight Data Compression for IIoT
    Ahn, Hyunho
    Lee, Munkyu
    Seong, Sihoon
    Na, Gap-Joo
    Chun, In-Geol
    Varghese, Blesson
    Hong, Cheol-Ho
    [J]. IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2024, : 11950 - 11960
  • [29] Communication compression techniques in distributed deep learning: A survey
    Wang, Zeqin
    Wen, Ming
    Xu, Yuedong
    Zhou, Yipeng
    Wang, Jessie Hui
    Zhang, Liang
    [J]. JOURNAL OF SYSTEMS ARCHITECTURE, 2023, 142
  • [30] Crux: GPU-Efficient Communication Scheduling for Deep Learning Training
    Cao, Jiamin
    Guan, Yu
    Qian, Kun
    Gao, Jiaqi
    Xiao, Wencong
    Dong, Jianbo
    Fu, Binzhang
    Cai, Dennis
    Zhai, Ennan
    [J]. PROCEEDINGS OF THE 2024 ACM SIGCOMM 2024 CONFERENCE, ACM SIGCOMM 2024, 2024, : 1 - 15