Optimal Re-Materialization Strategies for Heterogeneous Chains: How to Train Deep Neural Networks with Limited Memory

被引:0
|
作者
Beaumont, Olivier [1 ,2 ]
Eyraud-Dubois, Lionel [1 ,2 ]
Herrmann, Julien [3 ,4 ]
Joly, Alexis [5 ,6 ]
Shilova, Alena [7 ]
机构
[1] Univ Bordeaux, Inria Ctr, Bordeaux, France
[2] LaBRI, Bordeaux, France
[3] CNRS, Paris, France
[4] IRIT, Toulouse, France
[5] Inria Sophia Antipolis Mediterranee, Montpellier, France
[6] Univ Montpellier, Montpellier, France
[7] Univ Lille, CNRS, Cent Lille, Inria,CRIStAL, Lille, France
来源
关键词
Checkpointing; re-materialization; dynamic programming; convolutional neural networks; memory;
D O I
10.1145/3648633
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Training in Feed Forward Deep Neural Networks is a memory-intensive operation which is usually performed on GPUs with limited memory capacities. This may force data scientists to limit the depth of the models or the resolution of the input data if data does not fit in the GPU memory. The re-materialization technique, whose idea comes from the checkpointing strategies developed in the Automatic Differentiation literature, allows data scientists to limit the memory requirements related to the storage of intermediate data (activations), at the cost of an increase in the computational cost. This paper introduces a new strategy of re-materialization of activations that significantly reduces memory usage. It consists in selecting which activations are saved and which activations are deleted during the forward phase and then recomputing the deleted activations when they are needed during the backward phase. We propose an original computation model that combines two types of activation savings: either only storing the layer inputs or recording the complete history of operations that produced the outputs. This paper focuses on the fully heterogeneous case, where the computation time and the memory requirement of each layer is different. We prove that finding the optimal solution is NP-hard and that classical techniques from Automatic Differentiation literature do not apply. Moreover, the classical assumption of memory persistence of materialized activations, used to simplify the search of optimal solutions, does not hold anymore. Thus, we propose a weak memory persistence property and provide a dynamic program to compute the optimal sequence of computations. This algorithm is made available through the Rotor software, a PyTorch plug-in dealing with any network consisting of a sequence of layers, each of them having an arbitrarily complex structure. Through extensive experiments, we show that our implementation consistently outperforms existing re-materialization approaches for a large class of networks, image sizes, and batch sizes.
引用
收藏
页数:38
相关论文
共 7 条
  • [1] A Stochastic Modified Limited Memory BFGS for Training Deep Neural Networks
    Yousefi, Mahsa
    Calomardo, Angeles Martinez
    [J]. INTELLIGENT COMPUTING, VOL 2, 2022, 507 : 9 - 28
  • [2] How to Learn Quickly: An investigation of how to optimally train deep neural networks and its implications for human learning
    Rickard, Luke
    [J]. 2019 30TH IRISH SIGNALS AND SYSTEMS CONFERENCE (ISSC), 2019,
  • [3] Optimal market-Making strategies under synchronised order arrivals with deep neural networks
    Choi, So Eun
    Jang, Hyun Jin
    Lee, Kyungsub
    Zheng, Harry
    [J]. JOURNAL OF ECONOMIC DYNAMICS & CONTROL, 2021, 125
  • [4] Multi-source heterogeneous information fusion fault diagnosis method based on deep neural networks under limited datasets
    Han, Dongying
    Zhang, Yu
    Yu, Yue
    Tian, Jinghui
    Shi, Peiming
    [J]. APPLIED SOFT COMPUTING, 2024, 154
  • [5] A Heterogeneous In-Memory Computing Cluster for Flexible End-to-End Inference of Real-World Deep Neural Networks
    Garofalo, Angelo
    Ottavi, Gianmarco
    Conti, Francesco
    Karunaratne, Geethan
    Boybat, Irem
    Benini, Luca
    Rossi, Davide
    [J]. IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, 2022, 12 (02) : 422 - 435
  • [6] A new efficient training strategy for deep neural networks by hybridization of artificial bee colony and limited-memory BFGS optimization algorithms
    Badem, Hasan
    Basturk, Alper
    Caliskan, Abdullah
    Yuksel, Mehmet Emin
    [J]. NEUROCOMPUTING, 2017, 266 : 506 - 526
  • [7] Optimal Design Methods to Transform 3D NAND Flash into a High-Density, High-Bandwidth and Low-Power Nonvolatile Computing in Memory (nvCIM) Accelerator for Deep-Learning Neural Networks (DNN)
    Lue, Hang-Ting
    Hsu, Po-Kai
    Wei, Ming-Liang
    Yeh, Teng-Hao
    Du, Pei-Ying
    Chen, Wei-Chen
    Wang, Keh-Chung
    Lu, Chih-Yuan
    [J]. 2019 IEEE INTERNATIONAL ELECTRON DEVICES MEETING (IEDM), 2019,