Optimal Re-Materialization Strategies for Heterogeneous Chains: How to Train Deep Neural Networks with Limited Memory

被引：0

作者：

Beaumont, Olivier ^{[1
,2
]}

Eyraud-Dubois, Lionel ^{[1
,2
]}

Herrmann, Julien ^{[3
,4
]}

Joly, Alexis ^{[5
,6
]}

Shilova, Alena ^{[7
]}

机构：

[1] Univ Bordeaux, Inria Ctr, Bordeaux, France

[2] LaBRI, Bordeaux, France

[3] CNRS, Paris, France

[4] IRIT, Toulouse, France

[5] Inria Sophia Antipolis Mediterranee, Montpellier, France

[6] Univ Montpellier, Montpellier, France

[7] Univ Lille, CNRS, Cent Lille, Inria,CRIStAL, Lille, France

来源：

ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE | 2024年 / 50卷 / 02期

关键词：

Checkpointing; re-materialization; dynamic programming; convolutional neural networks; memory;

D O I：

10.1145/3648633

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Training in Feed Forward Deep Neural Networks is a memory-intensive operation which is usually performed on GPUs with limited memory capacities. This may force data scientists to limit the depth of the models or the resolution of the input data if data does not fit in the GPU memory. The re-materialization technique, whose idea comes from the checkpointing strategies developed in the Automatic Differentiation literature, allows data scientists to limit the memory requirements related to the storage of intermediate data (activations), at the cost of an increase in the computational cost. This paper introduces a new strategy of re-materialization of activations that significantly reduces memory usage. It consists in selecting which activations are saved and which activations are deleted during the forward phase and then recomputing the deleted activations when they are needed during the backward phase. We propose an original computation model that combines two types of activation savings: either only storing the layer inputs or recording the complete history of operations that produced the outputs. This paper focuses on the fully heterogeneous case, where the computation time and the memory requirement of each layer is different. We prove that finding the optimal solution is NP-hard and that classical techniques from Automatic Differentiation literature do not apply. Moreover, the classical assumption of memory persistence of materialized activations, used to simplify the search of optimal solutions, does not hold anymore. Thus, we propose a weak memory persistence property and provide a dynamic program to compute the optimal sequence of computations. This algorithm is made available through the Rotor software, a PyTorch plug-in dealing with any network consisting of a sequence of layers, each of them having an arbitrarily complex structure. Through extensive experiments, we show that our implementation consistently outperforms existing re-materialization approaches for a large class of networks, image sizes, and batch sizes.

引用

页数：38

共 7 条

[1] A Stochastic Modified Limited Memory BFGS for Training Deep Neural Networks
Yousefi, Mahsa
Calomardo, Angeles Martinez
[J]. INTELLIGENT COMPUTING, VOL 2, 2022, 507 : 9 - 28
[2] How to Learn Quickly: An investigation of how to optimally train deep neural networks and its implications for human learning
Rickard, Luke
[J]. 2019 30TH IRISH SIGNALS AND SYSTEMS CONFERENCE (ISSC), 2019,
[3] Optimal market-Making strategies under synchronised order arrivals with deep neural networks
Choi, So Eun
Jang, Hyun Jin
Lee, Kyungsub
Zheng, Harry
[J]. JOURNAL OF ECONOMIC DYNAMICS & CONTROL, 2021, 125
[4] Multi-source heterogeneous information fusion fault diagnosis method based on deep neural networks under limited datasets
Han, Dongying
Zhang, Yu
Yu, Yue
Tian, Jinghui
Shi, Peiming
[J]. APPLIED SOFT COMPUTING, 2024, 154
[5] A Heterogeneous In-Memory Computing Cluster for Flexible End-to-End Inference of Real-World Deep Neural Networks
Garofalo, Angelo
Ottavi, Gianmarco
Conti, Francesco
Karunaratne, Geethan
Boybat, Irem
Benini, Luca
Rossi, Davide
[J]. IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, 2022, 12 (02) : 422 - 435
[6] A new efficient training strategy for deep neural networks by hybridization of artificial bee colony and limited-memory BFGS optimization algorithms
Badem, Hasan
Basturk, Alper
Caliskan, Abdullah
Yuksel, Mehmet Emin
[J]. NEUROCOMPUTING, 2017, 266 : 506 - 526
[7] Optimal Design Methods to Transform 3D NAND Flash into a High-Density, High-Bandwidth and Low-Power Nonvolatile Computing in Memory (nvCIM) Accelerator for Deep-Learning Neural Networks (DNN)
Lue, Hang-Ting
Hsu, Po-Kai
Wei, Ming-Liang
Yeh, Teng-Hao
Du, Pei-Ying
Chen, Wei-Chen
Wang, Keh-Chung
Lu, Chih-Yuan
[J]. 2019 IEEE INTERNATIONAL ELECTRON DEVICES MEETING (IEDM), 2019,

← 1 →