mCAP: Memory-Centric Partitioning for Large-Scale Pipeline-Parallel DNN Training

被引:2
|
作者
Dreuning, Henk [1 ,2 ]
Bal, Henri E. [2 ]
van Nieuwpoort, Rob, V [1 ,3 ]
机构
[1] Univ Amsterdam, Amsterdam, Netherlands
[2] Vrije Univ Amsterdam, Amsterdam, Netherlands
[3] Netherlands eSci Ctr, Amsterdam, Netherlands
来源
基金
荷兰研究理事会;
关键词
Deep Learning; Pipeline Parallelism; HPC;
D O I
10.1007/978-3-031-12597-3_10
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Memory usage is becoming an increasingly pressing bottleneck in the training process of Deep Neural Networks (DNNs), especially when training on Graphics Processing Units (GPUs). Existing solutions for multi-GPU training setups partition the neural network over the GPUs in a way that favors training throughput over memory usage, and thus maximum trainable network size. We propose mCAP, a partitioning solution for pipeline-parallel DNN training that focuses specifically on memory usage. It evenly distributes Deep Learning models over the available resources with respect to per-device peak memory usage. Our partitioning approach uses a novel incremental profiling strategy to extract per-layer memory usage statistics. A model-based predictor uses the profiling data to recommend a partitioning that balances peak memory usage. Our approach is DL-framework agnostic and orthogonal to existing memory optimizations found in large-scale DNN training systems. Our results show that our approach enables training of neural networks that are 1.55 times larger than existing partitioning solutions in terms of the number of parameters.
引用
收藏
页码:155 / 170
页数:16
相关论文
共 50 条
  • [41] A Parallel Solution of Large-Scale Heat Equation Based on Distributed Memory Hierarchy System
    Cheng, Tangpei
    Wang, Qun
    Ji, Xiaohui
    Li, Dandan
    [J]. ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING, PT 2, PROCEEDINGS, 2010, 6082 : 413 - 421
  • [42] A Bi-layered Parallel Training Architecture for Large-Scale Convolutional Neural Networks
    Chen, Jianguo
    Li, Kenli
    Bilal, Kashif
    Zhou, Xu
    Li, Keqin
    Yu, Philip S.
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2019, 30 (05) : 965 - 976
  • [43] ZenLDA: Large-Scale Topic Model Training on Distributed Data-Parallel Platform
    Zhao, Bo
    Zhou, Hucheng
    Li, Guoqiang
    Huang, Yihua
    [J]. BIG DATA MINING AND ANALYTICS, 2018, 1 (01): : 57 - 74
  • [44] ZenLDA: Large-Scale Topic Model Training on Distributed Data-Parallel Platform
    Bo Zhao
    Hucheng Zhou
    Guoqiang Li
    Yihua Huang
    [J]. Big Data Mining and Analytics, 2018, 1 (01) : 57 - 74
  • [45] Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training
    Li, Shenggui
    Liu, Hongxin
    Bian, Zhengda
    Fang, Jiarui
    Huang, Haichen
    Liu, Yuliang
    Wang, Boxiang
    You, Yang
    [J]. PROCEEDINGS OF THE 52ND INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2023, 2023, : 766 - 775
  • [46] Parallel clustering method for Non-Disjoint Partitioning of Large-Scale Data based on Spark Framework
    Zayani, Abir
    Ben N'Cir, Chiheb-Eddine
    Essoussi, Nadia
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 1064 - 1069
  • [47] JS']JSweep: A Patch-centric Data-driven Approach for Parallel Sweeps on Large-scale Meshes
    Yan, Jie
    Yang, Zhang
    Zhang, Aiqing
    Mo, Zeyao
    [J]. PROCEEDINGS OF THE 52ND INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2023, 2023, : 776 - 785
  • [48] Intensive Working Memory Training Produces Functional Changes in Large-scale Frontoparietal Networks
    Thompson, Todd W.
    Waskom, Michael L.
    Gabrieli, John D. E.
    [J]. JOURNAL OF COGNITIVE NEUROSCIENCE, 2016, 28 (04) : 575 - 588
  • [49] Dynamic memory usage in parallel simulation: A case study of a large-scale military logistics application
    Booth, CJM
    Bruce, DI
    Hoare, PR
    Kirton, MJ
    Milner, KR
    Relf, IJ
    [J]. 1996 WINTER SIMULATION CONFERENCE PROCEEDINGS, 1996, : 975 - 982
  • [50] Prophet: Fine-grained Load Balancing for Parallel Training of Large-scale MoE Models
    Wang, Wei
    Lai, Zhiquan
    Li, Shengwei
    Liu, Weijie
    Ge, Keshi
    Liu, Yujie
    Shen, Ao
    Li, Dongsheng
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING, CLUSTER, 2023, : 82 - 94