Layer-parallel training of residual networks with auxiliary variable networks

被引：0

作者：

Sun, Qi ^{[1
,2
]}

Dong, Hexin ^{[3
]}

Chen, Zewei ^{[4
]}

Sun, Jiacheng ^{[4
]}

Li, Zhenguo ^{[4
]}

Dong, Bin ^{[3
,5
]}

机构：

[1] Tongji Univ, Sch Math Sci, 1239 Siping Rd, Shanghai, Peoples R China

[2] Tongji Univ, Key Lab Intelligent Comp & Applicat, Minist Educ, Shanghai, Peoples R China

[3] Peking Univ, Beijing Int Ctr Math Res, Beijing, Peoples R China

[4] Huawei Noahs Ark Lab, Shenzhen, Peoples R China

[5] Peking Univ, Ctr Data Sci, Beijing, Peoples R China

来源：

NUMERICAL METHODS FOR PARTIAL DIFFERENTIAL EQUATIONS | 2024年 / 40卷 / 06期

基金：

中国国家自然科学基金;

关键词：

auxiliary variable network; deep residual networks; optimal control of neural ordinary differential equations; penalty and augmented Lagrangian methods; synchronous layer-parallel training;

D O I：

10.1002/num.23147

中图分类号：

O29 [应用数学];

学科分类号：

070104 ;

摘要：

Gradient-based methods for training residual networks (ResNets) typically require a forward pass of input data, followed by back-propagating the error gradient to update model parameters, which becomes time-consuming as the network structure goes deeper. To break the algorithmic locking and exploit synchronous module parallelism in both forward and backward modes, auxiliary-variable methods have emerged but suffer from communication overhead and a lack of data augmentation. By trading off the recomputation and storage of auxiliary variables, a joint learning framework is proposed in this work for training realistic ResNets across multiple compute devices. Specifically, the input data of each processor is generated from its low-capacity auxiliary network (AuxNet), which permits the use of data augmentation and realizes forward unlocking. Backward passes are then executed in parallel, each with a local loss function derived from the penalty or augmented Lagrangian (AL) method. Finally, the AuxNet is adjusted to reproduce updated auxiliary variables through an end-to-end training process. We demonstrate the effectiveness of our method on ResNets and WideResNets across CIFAR-10, CIFAR-100, and ImageNet datasets, achieving speedup over the traditional layer-serial training approach while maintaining comparable testing accuracy.

引用

页数：25

共 50 条

[1] Layer-Parallel Training of Deep Residual Neural Networks
Guenther, Stefanie
Ruthotto, Lars
Schroder, Jacob B.
Cyr, Eric C.
Gauger, Nicolas R.
[J]. SIAM JOURNAL ON MATHEMATICS OF DATA SCIENCE, 2020, 2 (01): : 1 - 23
[2] Layer-Parallel Training with GPU Concurrency of Deep Residual Neural Networks via Nonlinear Multigrid
Kirby, Andrew
Samsi, Siddharth
Jones, Michael
Reuther, Albert
Kepner, Jeremy
Gadepally, Vijay
[J]. 2020 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2020,
[3] Design Space Exploration for Layer-parallel Execution of Convolutional Neural Networks on CGRAs
Heidorn, Christian
Hannig, Frank
Teich, Jurgen
[J]. PROCEEDINGS OF THE 23RD INTERNATIONAL WORKSHOP ON SOFTWARE AND COMPILERS FOR EMBEDDED SYSTEMS (SCOPES 2020), 2020, : 26 - 31
[4] Extensional layer-parallel shear and normal faulting
Ferrill, DA
Morris, AP
Jones, SM
Stamatakos, JA
[J]. JOURNAL OF STRUCTURAL GEOLOGY, 1998, 20 (04) : 355 - 362
[5] Residual Networks of Residual Networks: Multilevel Residual Networks
Zhang, Ke
Sun, Miao
Han, Tony X.
Yuan, Xingfang
Guo, Liru
Liu, Tao
[J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2018, 28 (06) : 1303 - 1314
[6] Parallel Training of Deep Stacking Networks
Deng, Li
Hutchinson, Brian
Yu, Dong
[J]. 13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 2597 - 2600
[7] Domino boudinage under layer-parallel simple shear
Dabrowski, Marcin
Grasemann, Bernhard
[J]. JOURNAL OF STRUCTURAL GEOLOGY, 2014, 68 : 58 - 65
[8] Training Deep Capsule Networks with Residual Connections
Gugglberger, Josef
Peer, David
Rodriguez-Sanchez, Antonio
[J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2021, PT I, 2021, 12891 : 541 - 552
[9] Folding of a single layer in an anisotropic viscous matrix under layer-parallel shortening
Hu, Yuan-bang
Bons, Paul D.
de Riese, Tamara
Liu, Shu-gen
Llorens, Maria-Gema
Gonzalez-Esvertit, Eloi
Gomez-Rivas, Enrique
Li, Dian
Fu, Yu-zhen
Cai, Xue-lin
[J]. JOURNAL OF STRUCTURAL GEOLOGY, 2024, 188
[10] Parallel Training of Neural Networks for Speech Recognition
Vesely, Karel
Burget, Lukas
Grezl, Frantisek
[J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2934 - 2937

← 1 2 3 4 5 →