Predicting Throughput of Distributed Stochastic Gradient Descent

被引：0

作者：

Li, Zhuojin ^{[1
]}

Paolieri, Marco ^{[1
]}

Golubchik, Leana ^{[1
]}

Lin, Sung-Han ^{[2
]}

Yan, Wumo ^{[1
]}

机构：

[1] Univ Southern Calif, Dept Comp Sci, Los Angeles, CA 90089 USA

[2] Meta, Menlo Pk, CA 94025 USA

来源：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS | 2022年 / 33卷 / 11期

关键词：

Computational modeling; Predictive models; Training; Throughput; Servers; Computer architecture; Uplink; Distributed machine learning; stochastic gradient descent; performance prediction; scalability; PyTorch;

D O I：

10.1109/TPDS.2022.3151739

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Training jobs of deep neural networks (DNNs) can be accelerated through distributed variants of stochastic gradient descent (SGD), where multiple nodes process training examples and exchange updates. The total throughput of the nodes depends not only on their computing power, but also on their networking speeds and coordination mechanism (synchronous or asynchronous, centralized or decentralized), since communication bottlenecks and stragglers can result in sublinear scaling when additional nodes are provisioned. In this paper, we propose two classes of performance models to predict throughput of distributed SGD: fine-grained models, representing many elementary computation/communication operations and their dependencies; and coarse-grained models, where SGD steps at each node are represented as a sequence of high-level phases without parallelism between computation and communication. Using a PyTorch implementation, real-world DNN models and different cloud environments, our experimental evaluation illustrates that, while fine-grained models are more accurate and can be easily adapted to new variants of distributed SGD, coarse-grained models can provide similarly accurate predictions when augmented with ad hoc heuristics, and their parameters can be estimated with profiling information that is easier to collect.

引用

页码：2900 / 2912

页数：13

共 50 条

[1] Bayesian Distributed Stochastic Gradient Descent
Teng, Michael
Wood, Frank
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
[2] Distributed stochastic gradient descent with discriminative aggregating
Chen, Zhen-Hong
Lan, Yan-Yan
Guo, Jia-Feng
Cheng, Xue-Qi
[J]. Jisuanji Xuebao/Chinese Journal of Computers, 2015, 38 (10): : 2054 - 2063
[3] Distributed and asynchronous Stochastic Gradient Descent with variance reduction
Ming, Yuewei
Zhao, Yawei
Wu, Chengkun
Li, Kuan
Yin, Jianping
[J]. NEUROCOMPUTING, 2018, 281 : 27 - 36
[4] Convergence analysis of distributed stochastic gradient descent with shuffling
Meng, Qi
Chen, Wei
Wang, Yue
Ma, Zhi-Ming
Liu, Tie-Yan
[J]. NEUROCOMPUTING, 2019, 337 : 46 - 57
[5] Distributed Stochastic Gradient Descent Using LDGM Codes
Horii, Shunsuke
Yoshida, Takahiro
Kobayashi, Manabu
Matsushima, Toshiyasu
[J]. 2019 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY (ISIT), 2019, : 1417 - 1421
[6] Distributed Stochastic Gradient Descent With Compressed and Skipped Communication
Phuong, Tran Thi
Phong, Le Trieu
Fukushima, Kazuhide
[J]. IEEE ACCESS, 2023, 11 : 99836 - 99846
[7] Communication-Censored Distributed Stochastic Gradient Descent
Li, Weiyu
Wu, Zhaoxian
Chen, Tianyi
Li, Liping
Ling, Qing
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (11) : 6831 - 6843
[8] A Distributed Optimal Control Problem with Averaged Stochastic Gradient Descent
Sun, Qi
Du, Qiang
[J]. COMMUNICATIONS IN COMPUTATIONAL PHYSICS, 2020, 27 (03) : 753 - 774
[9] A Sharp Estimate on the Transient Time of Distributed Stochastic Gradient Descent
Pu, Shi
Olshevsky, Alex
Paschalidis, Ioannis Ch
[J]. IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2022, 67 (11) : 5900 - 5915
[10] ON DISTRIBUTED STOCHASTIC GRADIENT DESCENT FOR NONCONVEX FUNCTIONS IN THE PRESENCE OF BYZANTINES
Bulusu, Saikiran
Khanduri, Prashant
Sharma, Pranay
Varshney, Pramod K.
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 3137 - 3141

← 1 2 3 4 5 →