DDLBench: Towards a Scalable Benchmarking Infrastructure for Distributed Deep Learning

被引:6
|
作者
Jansen, Matthijs [1 ,2 ]
Codreanu, Valeriu [1 ]
Varbanescu, Ana-Lucia [2 ]
机构
[1] SURFsara, Amsterdam, Netherlands
[2] Univ Amsterdam, Amsterdam, Netherlands
关键词
D O I
10.1109/DLS51937.2020.00009
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Due to its many applications across various fields of research, engineering, and daily life, deep learning has seen a surge in popularity. Therefore, larger and more expressive models have been proposed, with examples like Turing-NLG using as many as 17 billion parameters. Training these very large models becomes increasingly difficult due to the high computational costs and large memory footprint. Therefore, several approaches for distributed training based on data parallelism (e.g., Horovod) and model/pipeline parallelism (e.g., GPipe, PipeDream) have emerged. In this work, we focus on an in-depth comparison of three different parallelism models that address these needs: data, model and pipeline parallelism. To this end, we provide an analytical comparison of the three, both in terms of computation time and memory usage, and introduce DDLBench, a comprehensive (open-source1, ready-to-use) benchmark suite to quantify these differences in practice. Through in-depth performance analysis and experimentation with various models, datasets, distribution models and hardware systems, we demonstrate that DDLBench can accurately quantify the capability of a given system to perform distributed deep learning (DDL). By comparing our analytical models with the benchmarking results, we show how the performance of real-life implementations diverges from these analytical models, thus requiring benchmarking to capture the in-depth complexity of the frameworks themselves.
引用
收藏
页码:31 / 39
页数:9
相关论文
共 50 条
  • [21] Scalable Computation Offloading for Industrial IoTs via Distributed Deep Reinforcement Learning
    Dai, Bin
    Qiu, Yuan
    Feng, Weikun
    PROCEEDINGS OF THE 2024 27 TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN, CSCWD 2024, 2024, : 1681 - 1686
  • [22] Towards Faster Distributed Deep Learning Data Hashing Techniques
    Provatas, Nikodimos
    Konstantinou, Ioannis
    Koziris, Nectarios
    2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2019, : 6189 - 6191
  • [23] Distributed Scalable Edge Computing Infrastructure for Open Metaverse
    Zhou, Larry
    Lambert, Jordan
    Zheng, Yanyan
    Li, Zheng
    Yen, Alan
    Liu, Sandra
    Ye, Vivian
    Zhou, Maggie
    Mahar, David
    Gibbons, John
    Satterlee, Michael
    2023 IEEE CLOUD SUMMIT, 2023, : 1 - 6
  • [24] Towards accelerating model parallelism in distributed deep learning systems
    Choi, Hyeonseong
    Lee, Byung Hyun
    Chun, Se Young
    Lee, Jaehwan
    PLOS ONE, 2023, 18 (11):
  • [25] Fast and scalable all-optical network architecture for distributed deep learning
    Li, Wenzhe
    Yuan, Guojun
    Wang, Zhan
    Tan, Guangming
    Zhang, Peiheng
    Rouskas, George N.
    JOURNAL OF OPTICAL COMMUNICATIONS AND NETWORKING, 2024, 16 (03) : 342 - 357
  • [26] Towards an infrastructure for MLS distributed computing
    Kang, MH
    Froscher, JN
    Eppinger, BJ
    14TH ANNUAL COMPUTER SECURITY APPLICATIONS CONFERENCE, PROCEEDINGS, 1998, : 91 - 100
  • [27] Towards More Realistic Simulated Datasets for Benchmarking Deep Learning Models in Regulatory Genomics
    Prakash, Eva
    Shrikumar, Avanti
    Kundaje, Anshul
    MACHINE LEARNING IN COMPUTATIONAL BIOLOGY, VOL 165, 2021, 165 : 58 - 77
  • [28] GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server
    Cui, Henggang
    Zhang, Hao
    Ganger, Gregory R.
    Gibbons, Phillip B.
    Xing, Eric P.
    PROCEEDINGS OF THE ELEVENTH EUROPEAN CONFERENCE ON COMPUTER SYSTEMS, (EUROSYS 2016), 2016,
  • [29] Preserving Near-Optimal Gradient Sparsification Cost for Scalable Distributed Deep Learning
    Yoon, Daegun
    Oh, Sangyoon
    arXiv,
  • [30] Distributed and Scalable Cooperative Formation of Unmanned Ground Vehicles Using Deep Reinforcement Learning
    Huang, Shichun
    Wang, Tao
    Tang, Yong
    Hu, Yiwen
    Xin, Gu
    Zhou, Dianle
    AEROSPACE, 2023, 10 (02)