A DAG Model of Synchronous Stochastic Gradient Descent in Distributed Deep Learning

被引:0
|
作者
Shi, Shaohuai [1 ]
Wang, Qiang [1 ]
Chu, Xiaowen [1 ]
Li, Bo [2 ]
机构
[1] Hong Kong Baptist Univ, Dept Comp Sci, Hong Kong, Peoples R China
[2] Hong Kong Univ Sci & Technol, Dept Comp Sci & Engn, Hong Kong, Peoples R China
关键词
Deep Learning; Graphics Processing Units; Stochastic Gradient Descent; NVLink; InfiniBand; Directed Acyclic Graph;
D O I
10.1109/ICPADS.2018.00063
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
With huge amounts of training data, deep learning has made great breakthroughs in many artificial intelligence (AI) applications. However, such large-scale data sets present computational challenges, requiring training to be distributed on a cluster equipped with accelerators like GPUs. With the fast increase of GPU computing power, the data communications among GPUs have become a potential bottleneck on the overall training performance. In this paper, we first propose a general directed acyclic graph (DAG) model to describe the distributed synchronous stochastic gradient descent (S-SGD) algorithm, which has been widely used in distributed deep learning frameworks. To understand the practical impact of data communications on training performance, we conduct extensive empirical studies on four state-of-the-art distributed deep learning frameworks (i.e., Caffe-MPI, CNTK, MXNet and TensorFlow) over multi-GPU and multi-node environments with different data communication techniques, including PCIe, NVLink, 10GbE, and InfiniBand. Through both analytical and experimental studies, we identify the potential bottlenecks and overheads that could be further optimized. At last, we make the data set of our experimental traces publicly available, which could be used to support simulation based studies.
引用
收藏
页码:425 / 432
页数:8
相关论文
共 50 条
  • [11] Asymptotic Network Independence in Distributed Stochastic Optimization for Machine Learning: Examining Distributed and Centralized Stochastic Gradient Descent
    Pu, Shi
    Olshevsky, Alex
    Paschalidis, Ioannis Ch.
    [J]. IEEE SIGNAL PROCESSING MAGAZINE, 2020, 37 (03) : 114 - 122
  • [12] Anytime Exploitation of Stragglers in Synchronous Stochastic Gradient Descent
    Ferdinand, Nuwan
    Gharachorloo, Benjamin
    Draper, Stark C.
    [J]. 2017 16TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2017, : 141 - 146
  • [13] Distributed Gradient Descent for Functional Learning
    Yu, Zhan
    Fan, Jun
    Shi, Zhongjie
    Zhou, Ding-Xuan
    [J]. IEEE TRANSACTIONS ON INFORMATION THEORY, 2024, 70 (09) : 6547 - 6571
  • [14] Faster Distributed Deep Net Training: Computation and Communication Decoupled Stochastic Gradient Descent
    Shen, Shuheng
    Xu, Linli
    Liu, Jingchang
    Liang, Xianfeng
    Cheng, Yifei
    [J]. PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 4582 - 4589
  • [15] Deep learning for sea cucumber detection using stochastic gradient descent algorithm
    Zhang, Huaqiang
    Yu, Fusheng
    Sun, Jincheng
    Shen, Xiaoqin
    Li, Kun
    [J]. EUROPEAN JOURNAL OF REMOTE SENSING, 2020, 53 : 53 - 62
  • [16] Communication-Efficient Local Stochastic Gradient Descent for Scalable Deep Learning
    Lee, Sunwoo
    Kang, Qiao
    Agrawal, Ankit
    Choudhary, Alok
    Liao, Wei-keng
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2020, : 718 - 727
  • [17] Annealed Gradient Descent for Deep Learning
    Pan, Hengyue
    Jiang, Hui
    [J]. UNCERTAINTY IN ARTIFICIAL INTELLIGENCE, 2015, : 652 - 661
  • [18] Annealed gradient descent for deep learning
    Pan, Hengyue
    Niu, Xin
    Li, RongChun
    Dou, Yong
    Jiang, Hui
    [J]. NEUROCOMPUTING, 2020, 380 : 201 - 211
  • [19] Convergence analysis of distributed stochastic gradient descent with shuffling
    Meng, Qi
    Chen, Wei
    Wang, Yue
    Ma, Zhi-Ming
    Liu, Tie-Yan
    [J]. NEUROCOMPUTING, 2019, 337 : 46 - 57
  • [20] Distributed Stochastic Gradient Descent With Compressed and Skipped Communication
    Phuong, Tran Thi
    Phong, Le Trieu
    Fukushima, Kazuhide
    [J]. IEEE ACCESS, 2023, 11 : 99836 - 99846