Layer-parallel training of residual networks with auxiliary variable networks

被引:0
|
作者
Sun, Qi [1 ,2 ]
Dong, Hexin [3 ]
Chen, Zewei [4 ]
Sun, Jiacheng [4 ]
Li, Zhenguo [4 ]
Dong, Bin [3 ,5 ]
机构
[1] Tongji Univ, Sch Math Sci, 1239 Siping Rd, Shanghai, Peoples R China
[2] Tongji Univ, Key Lab Intelligent Comp & Applicat, Minist Educ, Shanghai, Peoples R China
[3] Peking Univ, Beijing Int Ctr Math Res, Beijing, Peoples R China
[4] Huawei Noahs Ark Lab, Shenzhen, Peoples R China
[5] Peking Univ, Ctr Data Sci, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
auxiliary variable network; deep residual networks; optimal control of neural ordinary differential equations; penalty and augmented Lagrangian methods; synchronous layer-parallel training;
D O I
10.1002/num.23147
中图分类号
O29 [应用数学];
学科分类号
070104 ;
摘要
Gradient-based methods for training residual networks (ResNets) typically require a forward pass of input data, followed by back-propagating the error gradient to update model parameters, which becomes time-consuming as the network structure goes deeper. To break the algorithmic locking and exploit synchronous module parallelism in both forward and backward modes, auxiliary-variable methods have emerged but suffer from communication overhead and a lack of data augmentation. By trading off the recomputation and storage of auxiliary variables, a joint learning framework is proposed in this work for training realistic ResNets across multiple compute devices. Specifically, the input data of each processor is generated from its low-capacity auxiliary network (AuxNet), which permits the use of data augmentation and realizes forward unlocking. Backward passes are then executed in parallel, each with a local loss function derived from the penalty or augmented Lagrangian (AL) method. Finally, the AuxNet is adjusted to reproduce updated auxiliary variables through an end-to-end training process. We demonstrate the effectiveness of our method on ResNets and WideResNets across CIFAR-10, CIFAR-100, and ImageNet datasets, achieving speedup over the traditional layer-serial training approach while maintaining comparable testing accuracy.
引用
收藏
页数:25
相关论文
共 50 条
  • [1] Layer-Parallel Training of Deep Residual Neural Networks
    Guenther, Stefanie
    Ruthotto, Lars
    Schroder, Jacob B.
    Cyr, Eric C.
    Gauger, Nicolas R.
    [J]. SIAM JOURNAL ON MATHEMATICS OF DATA SCIENCE, 2020, 2 (01): : 1 - 23
  • [2] Layer-Parallel Training with GPU Concurrency of Deep Residual Neural Networks via Nonlinear Multigrid
    Kirby, Andrew
    Samsi, Siddharth
    Jones, Michael
    Reuther, Albert
    Kepner, Jeremy
    Gadepally, Vijay
    [J]. 2020 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2020,
  • [3] Design Space Exploration for Layer-parallel Execution of Convolutional Neural Networks on CGRAs
    Heidorn, Christian
    Hannig, Frank
    Teich, Jurgen
    [J]. PROCEEDINGS OF THE 23RD INTERNATIONAL WORKSHOP ON SOFTWARE AND COMPILERS FOR EMBEDDED SYSTEMS (SCOPES 2020), 2020, : 26 - 31
  • [4] Extensional layer-parallel shear and normal faulting
    Ferrill, DA
    Morris, AP
    Jones, SM
    Stamatakos, JA
    [J]. JOURNAL OF STRUCTURAL GEOLOGY, 1998, 20 (04) : 355 - 362
  • [5] Residual Networks of Residual Networks: Multilevel Residual Networks
    Zhang, Ke
    Sun, Miao
    Han, Tony X.
    Yuan, Xingfang
    Guo, Liru
    Liu, Tao
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2018, 28 (06) : 1303 - 1314
  • [6] Parallel Training of Deep Stacking Networks
    Deng, Li
    Hutchinson, Brian
    Yu, Dong
    [J]. 13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 2597 - 2600
  • [7] Domino boudinage under layer-parallel simple shear
    Dabrowski, Marcin
    Grasemann, Bernhard
    [J]. JOURNAL OF STRUCTURAL GEOLOGY, 2014, 68 : 58 - 65
  • [8] Training Deep Capsule Networks with Residual Connections
    Gugglberger, Josef
    Peer, David
    Rodriguez-Sanchez, Antonio
    [J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2021, PT I, 2021, 12891 : 541 - 552
  • [9] Folding of a single layer in an anisotropic viscous matrix under layer-parallel shortening
    Hu, Yuan-bang
    Bons, Paul D.
    de Riese, Tamara
    Liu, Shu-gen
    Llorens, Maria-Gema
    Gonzalez-Esvertit, Eloi
    Gomez-Rivas, Enrique
    Li, Dian
    Fu, Yu-zhen
    Cai, Xue-lin
    [J]. JOURNAL OF STRUCTURAL GEOLOGY, 2024, 188
  • [10] Parallel Training of Neural Networks for Speech Recognition
    Vesely, Karel
    Burget, Lukas
    Grezl, Frantisek
    [J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2934 - 2937