mCAP: Memory-Centric Partitioning for Large-Scale Pipeline-Parallel DNN Training

被引：2

作者：

Dreuning, Henk ^{[1
,2
]}

Bal, Henri E. ^{[2
]}

van Nieuwpoort, Rob, V ^{[1
,3
]}

机构：

[1] Univ Amsterdam, Amsterdam, Netherlands

[2] Vrije Univ Amsterdam, Amsterdam, Netherlands

[3] Netherlands eSci Ctr, Amsterdam, Netherlands

来源：

EURO-PAR 2022: PARALLEL PROCESSING | 2022年 / 13440卷

基金：

荷兰研究理事会;

关键词：

Deep Learning; Pipeline Parallelism; HPC;

D O I：

10.1007/978-3-031-12597-3_10

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Memory usage is becoming an increasingly pressing bottleneck in the training process of Deep Neural Networks (DNNs), especially when training on Graphics Processing Units (GPUs). Existing solutions for multi-GPU training setups partition the neural network over the GPUs in a way that favors training throughput over memory usage, and thus maximum trainable network size. We propose mCAP, a partitioning solution for pipeline-parallel DNN training that focuses specifically on memory usage. It evenly distributes Deep Learning models over the available resources with respect to per-device peak memory usage. Our partitioning approach uses a novel incremental profiling strategy to extract per-layer memory usage statistics. A model-based predictor uses the profiling data to recommend a partitioning that balances peak memory usage. Our approach is DL-framework agnostic and orthogonal to existing memory optimizations found in large-scale DNN training systems. Our results show that our approach enables training of neural networks that are 1.55 times larger than existing partitioning solutions in terms of the number of parameters.

引用

页码：155 / 170

页数：16

共 50 条

[41] A Parallel Solution of Large-Scale Heat Equation Based on Distributed Memory Hierarchy System
Cheng, Tangpei
Wang, Qun
Ji, Xiaohui
Li, Dandan
[J]. ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING, PT 2, PROCEEDINGS, 2010, 6082 : 413 - 421
[42] A Bi-layered Parallel Training Architecture for Large-Scale Convolutional Neural Networks
Chen, Jianguo
Li, Kenli
Bilal, Kashif
Zhou, Xu
Li, Keqin
Yu, Philip S.
[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2019, 30 (05) : 965 - 976
[43] ZenLDA: Large-Scale Topic Model Training on Distributed Data-Parallel Platform
Zhao, Bo
Zhou, Hucheng
Li, Guoqiang
Huang, Yihua
[J]. BIG DATA MINING AND ANALYTICS, 2018, 1 (01): : 57 - 74
[44] ZenLDA: Large-Scale Topic Model Training on Distributed Data-Parallel Platform
Bo Zhao
Hucheng Zhou
Guoqiang Li
Yihua Huang
[J]. Big Data Mining and Analytics, 2018, 1 (01) : 57 - 74
[45] Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training
Li, Shenggui
Liu, Hongxin
Bian, Zhengda
Fang, Jiarui
Huang, Haichen
Liu, Yuliang
Wang, Boxiang
You, Yang
[J]. PROCEEDINGS OF THE 52ND INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2023, 2023, : 766 - 775
[46] Parallel clustering method for Non-Disjoint Partitioning of Large-Scale Data based on Spark Framework
Zayani, Abir
Ben N'Cir, Chiheb-Eddine
Essoussi, Nadia
[J]. 2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 1064 - 1069
[47] JS']JSweep: A Patch-centric Data-driven Approach for Parallel Sweeps on Large-scale Meshes
Yan, Jie
Yang, Zhang
Zhang, Aiqing
Mo, Zeyao
[J]. PROCEEDINGS OF THE 52ND INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2023, 2023, : 776 - 785
[48] Intensive Working Memory Training Produces Functional Changes in Large-scale Frontoparietal Networks
Thompson, Todd W.
Waskom, Michael L.
Gabrieli, John D. E.
[J]. JOURNAL OF COGNITIVE NEUROSCIENCE, 2016, 28 (04) : 575 - 588
[49] Dynamic memory usage in parallel simulation: A case study of a large-scale military logistics application
Booth, CJM
Bruce, DI
Hoare, PR
Kirton, MJ
Milner, KR
Relf, IJ
[J]. 1996 WINTER SIMULATION CONFERENCE PROCEEDINGS, 1996, : 975 - 982
[50] Prophet: Fine-grained Load Balancing for Parallel Training of Large-scale MoE Models
Wang, Wei
Lai, Zhiquan
Li, Shengwei
Liu, Weijie
Ge, Keshi
Liu, Yujie
Shen, Ao
Li, Dongsheng
[J]. 2023 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING, CLUSTER, 2023, : 82 - 94

← 1 2 3 4 5 →