Efficient Training of Large-Scale Neural Networks Using Linear Pipeline Broadcast

被引：0

作者：

University of Science and Technology, Department of Big Data Science, Daejeon ^{[1
]}

34112, Korea, Republic of

不详 ^{[2
]}

34141, Korea, Republic of

不详 ^{[3
]}

34112, Korea, Republic of

机构：

来源：

IEEE Access | 2024年 / 165653-165662期

关键词：

D O I：

10.1109/ACCESS.2024.3492314

中图分类号：

学科分类号：

摘要：

Recently, the adoption of deep learning models in several domains and for various tasks has increased, correspondingly amplifying the number of model layers and parameters needed to achieve the required performance. Accordingly, the amount of memory required for model training has increased, advancing the adoption and exploration of distributed training. Generally, model parallelism techniques require a large amount of memory for training during distributed training. Among them, layer pipelining, which involves dividing the model into layers and configuring the stages on the devices, has attracted interest. Activation recomputation is a popular method for efficiently utilizing pipeline parallelism while minimizing memory consumption. However, it can lead to a decrease in training throughput due to redundant operations. Therefore, this study introduces a forward propagation technique that employs a linear pipeline broadcast method to decrease memory consumption while mitigating training throughput reduction by partially integrating recomputation in PipeDream-Flush. The proposed broadcast-based forward propagation offsets the overhead caused by activation recomputation by optimizing network communication between pipeline stages and reducing bubbles in the warm-up phase of the pipeline. Experimental results demonstrate that the proposed technique reduces memory consumption by approximately 36.0% at peak training throughput for GPT2 than PipeDream-Flush, without a significant decrease in training throughput. Compared with that for PipeDream-Flush, the proposed method achieved peak training throughputs of 14.6% and 12.6% higher for the ResNet152 and VGG19 models, respectively, while consuming 30.1% and 12.0% lesser memory. © 2013 IEEE.

引用

页码：165653 / 165662

共 50 条

[41] On Efficient Training of Large-Scale Deep Learning Models
Shen, Li
Sun, Yan
Yu, Zhiyuan
Ding, Liang
Tian, Xinmei
Tao, Dacheng
ACM Computing Surveys, 57 (03):
[42] Control of Average and Deviation in Large-Scale Linear Networks
Nikitin, Denis
Canudas-de-Wit, Carlos
Frasca, Paolo
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2022, 67 (04) : 1639 - 1654
[43] Efficient on-chip training of large-scale optical neural network through block adjoint training algorithm
Yang, Zhiwei
Zhang, Tian
Dai, Jian
Xu, Kun
Optics Express, 2024, 32 (26) : 46633 - 46648
[44] Graphon Control of Large-Scale Networks of Linear Systems
Gao, Shuang
Caines, Peter E.
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2020, 65 (10) : 4090 - 4105
[45] Decentralized control of a class of large-scale nonlinear systems using neural networks
Huang, SN
Tan, KK
Lee, TH
AUTOMATICA, 2005, 41 (09) : 1645 - 1649
[46] Large-scale Video Classification with Convolutional Neural Networks
Karpathy, Andrej
Toderici, George
Shetty, Sanketh
Leung, Thomas
Sukthankar, Rahul
Fei-Fei, Li
2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, : 1725 - 1732
[47] Visualization of density relations in large-scale neural networks
Z. Nadasdy
L. Zaborszky
Anatomy and Embryology, 2001, 204 : 303 - 317
[48] Efficient training of large neural networks for language modeling
Schwenk, H
2004 IEEE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOLS 1-4, PROCEEDINGS, 2004, : 3059 - 3064
[49] Neural Correlates of Unconsciousness in Large-Scale Brain Networks
Mashour, George A.
Hudetz, Anthony G.
TRENDS IN NEUROSCIENCES, 2018, 41 (03) : 150 - 160
[50] Large-scale neural networks and the lateralization of motivation and emotion
Tops, Mattie
Quirin, Markus
Boksem, Maarten A. S.
Koole, Sander L.
INTERNATIONAL JOURNAL OF PSYCHOPHYSIOLOGY, 2017, 119 : 41 - 49

← 1 2 3 4 5 →