mCAP: Memory-Centric Partitioning for Large-Scale Pipeline-Parallel DNN Training

被引：2

作者：

Dreuning, Henk ^{[1
,2
]}

Bal, Henri E. ^{[2
]}

van Nieuwpoort, Rob, V ^{[1
,3
]}

机构：

[1] Univ Amsterdam, Amsterdam, Netherlands

[2] Vrije Univ Amsterdam, Amsterdam, Netherlands

[3] Netherlands eSci Ctr, Amsterdam, Netherlands

来源：

EURO-PAR 2022: PARALLEL PROCESSING | 2022年 / 13440卷

基金：

荷兰研究理事会;

关键词：

Deep Learning; Pipeline Parallelism; HPC;

D O I：

10.1007/978-3-031-12597-3_10

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Memory usage is becoming an increasingly pressing bottleneck in the training process of Deep Neural Networks (DNNs), especially when training on Graphics Processing Units (GPUs). Existing solutions for multi-GPU training setups partition the neural network over the GPUs in a way that favors training throughput over memory usage, and thus maximum trainable network size. We propose mCAP, a partitioning solution for pipeline-parallel DNN training that focuses specifically on memory usage. It evenly distributes Deep Learning models over the available resources with respect to per-device peak memory usage. Our partitioning approach uses a novel incremental profiling strategy to extract per-layer memory usage statistics. A model-based predictor uses the profiling data to recommend a partitioning that balances peak memory usage. Our approach is DL-framework agnostic and orthogonal to existing memory optimizations found in large-scale DNN training systems. Our results show that our approach enables training of neural networks that are 1.55 times larger than existing partitioning solutions in terms of the number of parameters.

引用

页码：155 / 170

页数：16

共 50 条

[1] Memory-Efficient Pipeline-Parallel DNN Training
Narayanan, Deepak
Phanishayee, Amar
Shi, Kaiyu
Chen, Xie
Zaharia, Matei
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
[2] CAPTURE: Memory-Centric Partitioning for Distributed DNN Training with Hybrid Parallelism
Dreuning, Henk
Verstoep, Kees
Bal, Henri E.
van Nieuwpoort, Rob V.
[J]. 2023 IEEE 30TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, AND ANALYTICS, HIPC 2023, 2023, : 76 - 86
[3] CAPSlog: Scalable Memory-Centric Partitioning for Pipeline Parallelism
Dreuning, Henk
Liokouras, Anna Badia
Ouyang, Xiaowei
Bal, Henri E.
van Nieuwpoort, Rob V.
[J]. 2024 32ND EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING, PDP 2024, 2024, : 17 - 25
[4] Visual Diagnostics of Parallel Performance in Training Large-Scale DNN Models
Wei, Yating
Wang, Zhiyong
Wang, Zhongwei
Dai, Yong
Ou, Gongchang
Gao, Han
Yang, Haitao
Wang, Yue
Cao, Caleb Chen
Weng, Luoxuan
Lu, Jiaying
Zhu, Rongchen
Chen, Wei
[J]. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2024, 30 (07) : 3915 - 3929
[5] Muulti-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture
Hong, Byungchul
Ro, Yeonju
Kim, John
[J]. 2018 51ST ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO), 2018, : 682 - 695
[6] Swift : Expedited Failure Recovery for Large-Scale DNN Training
Zhong, Yuchen
Sheng, Guangming
Liu, Juncheng
Yuan, Jinhui
Wu, Chuan
[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2024, 35 (09) : 1644 - 1656
[7] A PARALLEL PARTITIONING METHOD FOR LARGE-SCALE CIRCUIT SIMULATION
ZHANG, XD
[J]. UNIVERSITY PROGRAMS IN COMPUTER-AIDED ENGINEERING, DESIGN, AND MANUFACTURING, 1989, : 134 - 141
[8] DistSim: A performance model of large-scale hybrid distributed DNN training
Lu, Guandong
Chen, Runzhe
Wang, Yakai
Zhou, Yangjie
Zhang, Rui
Hu, Zheng
Miao, Yanming
Cai, Zhifang
Li, Li
Leng, Jingwen
Guo, Minyi
[J]. PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS 2023, CF 2023, 2023, : 112 - 122
[9] GradientFlow: Optimizing Network Performance for Large-Scale Distributed DNN Training
Sun, Peng
Wen, Yonggang
Han, Ruobing
Feng, Wansen
Yan, Shengen
[J]. IEEE TRANSACTIONS ON BIG DATA, 2022, 8 (02) : 495 - 507
[10] Graph-Centric Performance Analysis for Large-Scale Parallel Applications
Jin, Yuyang
Wang, Haojie
Zhong, Runxin
Zhang, Chen
Liao, Xia
Zhang, Feng
Zhai, Jidong
[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2024, 35 (07) : 1221 - 1238

← 1 2 3 4 5 →