An efficient bandwidth-adaptive gradient compression algorithm for distributed training of deep neural networks

被引：1

作者：

Wang, Zeqin ^{[1
]}

Duan, Qingyang ^{[1
]}

Xu, Yuedong ^{[1
]}

Zhang, Liang ^{[2
]}

机构：

[1] Fudan Univ, Sch Informat Sci & Technol, Shanghai 200433, Peoples R China

[2] Huawei Technol, Nanjing Res Ctr, Nanjing 210096, Peoples R China

来源：

JOURNAL OF SYSTEMS ARCHITECTURE | 2024年 / 150卷

基金：

上海市自然科学基金; 中国国家自然科学基金;

关键词：

Distributed deep learning; Gradient compression; Adaptive sparsification; Dynamic bandwidth; COMMUNICATION; OPTIMIZATION;

D O I：

10.1016/j.sysarc.2024.103116

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

In distributed deep learning with data parallelism, communication bottleneck throttles the efficiency of model training. Recent studies adopt versatile gradient compression techniques, with communication sparsification standing out as an effective approach for reducing the number of gradients to be transmitted. However, the deployment of gradient sparsification is adversely influenced by the change of network environment in real systems, and existing methods either neglect bandwidth dynamics during training or experience drastic fluctuation of compression ratios. In this paper, we propose ACE, a novel adaptive gradient compression mechanism with high communication efficiency under bandwidth variation. ACE adapts the sparsification ratio to the average bandwidth in a time window, other than following its dynamics exactly. To accurately compute the compression ratio, we first profile the compression time and model a single iteration time consisting of communication, computation and compression operations. We then present a practical model to fit the needed training rounds till convergence, and formulate an optimization problem to compute the optimal sparsification ratio. We conduct experiments on different DNN models in different network environments and compare various methods in terms of convergence and model quality. The experimental results show that ACE achieves up to 9.39 x and 1.28 x training speedups over fixed and state-of-the-art adaptive compression methods.

引用

页数：14

共 50 条

[31] DeepCABAC: A Universal Compression Algorithm for Deep Neural Networks
Wiedemann, Simon
Kirchhoffer, Heiner
Matlage, Stefan
Haase, Paul
Marban, Arturo
Marinc, Talmaj
Neumann, David
Nguyen, Tung
Schwarz, Heiko
Wiegand, Thomas
Marpe, Detlev
Samek, Wojciech
[J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2020, 14 (04) : 700 - 714
[32] Efficient and Structural Gradient Compression with Principal Component Analysis for Distributed Training
Tan, Jiaxin
Yao, Chao
Guo, Zehua
[J]. PROCEEDINGS OF THE 7TH ASIA-PACIFIC WORKSHOP ON NETWORKING, APNET 2023, 2023, : 217 - 218
[33] An Efficient Method for Training Deep Learning Networks Distributed
Wang, Chenxu
Lu, Yutong
Chen, Zhiguang
Li, Junnan
[J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2020, E103D (12) : 2444 - 2456
[34] Research and design of distributed training algorithm for neural networks
Yang, B
Wang, YD
Su, XH
[J]. Proceedings of 2005 International Conference on Machine Learning and Cybernetics, Vols 1-9, 2005, : 4044 - 4049
[35] SPEAKER ADAPTIVE TRAINING USING DEEP NEURAL NETWORKS
Ochiai, Tsubasa
Matsuda, Shigeki
Lu, Xugang
Hori, Chiori
Katagiri, Shigeru
[J]. 2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
[36] IMPROVEMENTS TO SPEAKER ADAPTIVE TRAINING OF DEEP NEURAL NETWORKS
Miao, Yajie
Jiang, Lu
Zhang, Hao
Metze, Florian
[J]. 2014 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY SLT 2014, 2014, : 165 - 170
[37] The adaptive fuzzy training algorithm for feedforward neural networks
Xie, P.
Liu, B.
[J]. Xi Tong Gong Cheng Yu Dian Zi Ji Shu/Systems Engineering and Electronics, 2001, 23 (07): : 79 - 82
[38] An Adaptive Gradient Method with Differentiation Element in Deep Neural Networks
Wang, Runqi
Wang, Wei
Ma, Teli
Zhang, Baochang
[J]. PROCEEDINGS OF THE 15TH IEEE CONFERENCE ON INDUSTRIAL ELECTRONICS AND APPLICATIONS (ICIEA 2020), 2020, : 1582 - 1587
[39] Gradient Descent Analysis: On Visualizing the Training of Deep Neural Networks
Becker, Martin
Lippel, Jens
Zielke, Thomas
[J]. PROCEEDINGS OF THE 14TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER VISION, IMAGING AND COMPUTER GRAPHICS THEORY AND APPLICATIONS - VOL 3: IVAPP, 2019, : 338 - 345
[40] AdaComp: Adaptive Residual Gradient Compression for Data-Parallel Distributed Training
Chen, Chia-Yu
Choi, Jungwook
Brand, Daniel
Agrawal, Ankur
Zhang, Wei
Gopalakrishnan, Kailash
[J]. THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 2827 - 2835

← 1 2 3 4 5 →