A DAG Model of Synchronous Stochastic Gradient Descent in Distributed Deep Learning

被引:0
|
作者
Shi, Shaohuai [1 ]
Wang, Qiang [1 ]
Chu, Xiaowen [1 ]
Li, Bo [2 ]
机构
[1] Hong Kong Baptist Univ, Dept Comp Sci, Hong Kong, Peoples R China
[2] Hong Kong Univ Sci & Technol, Dept Comp Sci & Engn, Hong Kong, Peoples R China
关键词
Deep Learning; Graphics Processing Units; Stochastic Gradient Descent; NVLink; InfiniBand; Directed Acyclic Graph;
D O I
10.1109/ICPADS.2018.00063
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
With huge amounts of training data, deep learning has made great breakthroughs in many artificial intelligence (AI) applications. However, such large-scale data sets present computational challenges, requiring training to be distributed on a cluster equipped with accelerators like GPUs. With the fast increase of GPU computing power, the data communications among GPUs have become a potential bottleneck on the overall training performance. In this paper, we first propose a general directed acyclic graph (DAG) model to describe the distributed synchronous stochastic gradient descent (S-SGD) algorithm, which has been widely used in distributed deep learning frameworks. To understand the practical impact of data communications on training performance, we conduct extensive empirical studies on four state-of-the-art distributed deep learning frameworks (i.e., Caffe-MPI, CNTK, MXNet and TensorFlow) over multi-GPU and multi-node environments with different data communication techniques, including PCIe, NVLink, 10GbE, and InfiniBand. Through both analytical and experimental studies, we identify the potential bottlenecks and overheads that could be further optimized. At last, we make the data set of our experimental traces publicly available, which could be used to support simulation based studies.
引用
收藏
页码:425 / 432
页数:8
相关论文
共 50 条
  • [1] Leader Stochastic Gradient Descent for Distributed Training of Deep Learning Models
    Teng, Yunfei
    Gao, Wenbo
    Chalus, Francois
    Choromanska, Anna
    Goldfarb, Donald
    Weller, Adrian
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [2] An efficient, distributed stochastic gradient descent algorithm for deep-learning applications
    Cong, Guojing
    Bhardwaj, Onkar
    Feng, Minwei
    [J]. 2017 46TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP), 2017, : 11 - 20
  • [3] Recent Advances in Stochastic Gradient Descent in Deep Learning
    Tian, Yingjie
    Zhang, Yuqi
    Zhang, Haibin
    [J]. MATHEMATICS, 2023, 11 (03)
  • [4] Stochastic Gradient Push for Distributed Deep Learning
    Assran, Mahmoud
    Loizou, Nicolas
    Ballas, Nicolas
    Rabbat, Michael
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97
  • [5] Weighted Aggregating Stochastic Gradient Descent for Parallel Deep Learning
    Guo, Pengzhan
    Ye, Zeyang
    Xiao, Keli
    Zhu, Wei
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2022, 34 (10) : 5037 - 5050
  • [6] Bayesian Distributed Stochastic Gradient Descent
    Teng, Michael
    Wood, Frank
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [7] A Hierarchical, bulk-synchronous stochastic gradient descent algorithm for deep-learning applications on GPU clusters
    Cong, Guojing
    Bhardwaj, Onkar
    [J]. 2017 16TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2017, : 818 - 821
  • [8] Elastic Consistency: A Practical Consistency Model for Distributed Stochastic Gradient Descent
    Nadiradze, Giorgi
    Markov, Ilia
    Chatterjee, Bapi
    Kungurtsev, Vyacheslav
    Alistarh, Dan
    [J]. THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 9037 - 9045
  • [9] A Novel Stochastic Gradient Descent Algorithm Based on Grouping over Heterogeneous Cluster Systems for Distributed Deep Learning
    Jiang, Wenbin
    Ye, Geyan
    Yang, Laurence T.
    Zhu, Jian
    Ma, Yang
    Xie, Xia
    Jin, Hai
    [J]. 2019 19TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2019, : 391 - 398
  • [10] Predicting Throughput of Distributed Stochastic Gradient Descent
    Li, Zhuojin
    Paolieri, Marco
    Golubchik, Leana
    Lin, Sung-Han
    Yan, Wumo
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (11) : 2900 - 2912