A DAG Model of Synchronous Stochastic Gradient Descent in Distributed Deep Learning

被引：0

作者：

Shi, Shaohuai ^{[1
]}

Wang, Qiang ^{[1
]}

Chu, Xiaowen ^{[1
]}

Li, Bo ^{[2
]}

机构：

[1] Hong Kong Baptist Univ, Dept Comp Sci, Hong Kong, Peoples R China

[2] Hong Kong Univ Sci & Technol, Dept Comp Sci & Engn, Hong Kong, Peoples R China

来源：

2018 IEEE 24TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS 2018) | 2018年

关键词：

Deep Learning; Graphics Processing Units; Stochastic Gradient Descent; NVLink; InfiniBand; Directed Acyclic Graph;

D O I：

10.1109/ICPADS.2018.00063

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

With huge amounts of training data, deep learning has made great breakthroughs in many artificial intelligence (AI) applications. However, such large-scale data sets present computational challenges, requiring training to be distributed on a cluster equipped with accelerators like GPUs. With the fast increase of GPU computing power, the data communications among GPUs have become a potential bottleneck on the overall training performance. In this paper, we first propose a general directed acyclic graph (DAG) model to describe the distributed synchronous stochastic gradient descent (S-SGD) algorithm, which has been widely used in distributed deep learning frameworks. To understand the practical impact of data communications on training performance, we conduct extensive empirical studies on four state-of-the-art distributed deep learning frameworks (i.e., Caffe-MPI, CNTK, MXNet and TensorFlow) over multi-GPU and multi-node environments with different data communication techniques, including PCIe, NVLink, 10GbE, and InfiniBand. Through both analytical and experimental studies, we identify the potential bottlenecks and overheads that could be further optimized. At last, we make the data set of our experimental traces publicly available, which could be used to support simulation based studies.

引用

页码：425 / 432

页数：8

共 50 条

[11] Asymptotic Network Independence in Distributed Stochastic Optimization for Machine Learning: Examining Distributed and Centralized Stochastic Gradient Descent
Pu, Shi
Olshevsky, Alex
Paschalidis, Ioannis Ch.
[J]. IEEE SIGNAL PROCESSING MAGAZINE, 2020, 37 (03) : 114 - 122
[12] Anytime Exploitation of Stragglers in Synchronous Stochastic Gradient Descent
Ferdinand, Nuwan
Gharachorloo, Benjamin
Draper, Stark C.
[J]. 2017 16TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2017, : 141 - 146
[13] Distributed Gradient Descent for Functional Learning
Yu, Zhan
Fan, Jun
Shi, Zhongjie
Zhou, Ding-Xuan
[J]. IEEE TRANSACTIONS ON INFORMATION THEORY, 2024, 70 (09) : 6547 - 6571
[14] Faster Distributed Deep Net Training: Computation and Communication Decoupled Stochastic Gradient Descent
Shen, Shuheng
Xu, Linli
Liu, Jingchang
Liang, Xianfeng
Cheng, Yifei
[J]. PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 4582 - 4589
[15] Deep learning for sea cucumber detection using stochastic gradient descent algorithm
Zhang, Huaqiang
Yu, Fusheng
Sun, Jincheng
Shen, Xiaoqin
Li, Kun
[J]. EUROPEAN JOURNAL OF REMOTE SENSING, 2020, 53 : 53 - 62
[16] Communication-Efficient Local Stochastic Gradient Descent for Scalable Deep Learning
Lee, Sunwoo
Kang, Qiao
Agrawal, Ankit
Choudhary, Alok
Liao, Wei-keng
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2020, : 718 - 727
[17] Annealed Gradient Descent for Deep Learning
Pan, Hengyue
Jiang, Hui
[J]. UNCERTAINTY IN ARTIFICIAL INTELLIGENCE, 2015, : 652 - 661
[18] Annealed gradient descent for deep learning
Pan, Hengyue
Niu, Xin
Li, RongChun
Dou, Yong
Jiang, Hui
[J]. NEUROCOMPUTING, 2020, 380 : 201 - 211
[19] Convergence analysis of distributed stochastic gradient descent with shuffling
Meng, Qi
Chen, Wei
Wang, Yue
Ma, Zhi-Ming
Liu, Tie-Yan
[J]. NEUROCOMPUTING, 2019, 337 : 46 - 57
[20] Distributed Stochastic Gradient Descent With Compressed and Skipped Communication
Phuong, Tran Thi
Phong, Le Trieu
Fukushima, Kazuhide
[J]. IEEE ACCESS, 2023, 11 : 99836 - 99846

← 1 2 3 4 5 →