DDLBench: Towards a Scalable Benchmarking Infrastructure for Distributed Deep Learning

被引:6
|
作者
Jansen, Matthijs [1 ,2 ]
Codreanu, Valeriu [1 ]
Varbanescu, Ana-Lucia [2 ]
机构
[1] SURFsara, Amsterdam, Netherlands
[2] Univ Amsterdam, Amsterdam, Netherlands
关键词
D O I
10.1109/DLS51937.2020.00009
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Due to its many applications across various fields of research, engineering, and daily life, deep learning has seen a surge in popularity. Therefore, larger and more expressive models have been proposed, with examples like Turing-NLG using as many as 17 billion parameters. Training these very large models becomes increasingly difficult due to the high computational costs and large memory footprint. Therefore, several approaches for distributed training based on data parallelism (e.g., Horovod) and model/pipeline parallelism (e.g., GPipe, PipeDream) have emerged. In this work, we focus on an in-depth comparison of three different parallelism models that address these needs: data, model and pipeline parallelism. To this end, we provide an analytical comparison of the three, both in terms of computation time and memory usage, and introduce DDLBench, a comprehensive (open-source1, ready-to-use) benchmark suite to quantify these differences in practice. Through in-depth performance analysis and experimentation with various models, datasets, distribution models and hardware systems, we demonstrate that DDLBench can accurately quantify the capability of a given system to perform distributed deep learning (DDL). By comparing our analytical models with the benchmarking results, we show how the performance of real-life implementations diverges from these analytical models, thus requiring benchmarking to capture the in-depth complexity of the frameworks themselves.
引用
收藏
页码:31 / 39
页数:9
相关论文
共 50 条
  • [1] Towards a Scalable and Distributed Infrastructure for Deep Learning Applications
    Hasheminezhad, Bita
    Shirzad, Shahrzad
    Wu, Nanmiao
    Diehl, Patrick
    Schulz, Hannes
    Kaiser, Hartmut
    PROCEEDINGS OF 2020 IEEE/ACM 5TH WORKSHOP ON DEEP LEARNING ON SUPERCOMPUTERS (DLS 2020), 2020, : 20 - 30
  • [2] The Design and Implementation of a Scalable Deep Learning Benchmarking Platform
    Li, Cheng
    Dakkak, Abdul
    Xiong, Jinjun
    Hwu, Wen-mei
    2020 IEEE 13TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING (CLOUD 2020), 2020, : 414 - 425
  • [3] ScaDL 2022: Fourth IPDPS Workshop on Scalable Deep Learning over Parallel and Distributed Infrastructure
    Ardagna, Danilo
    Patterson, Stacy
    Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2022, 2022,
  • [4] Performance Analysis of Distributed and Scalable Deep Learning
    Mahon, Sean
    Varrette, Sebastien
    Plugaru, Valentin
    Pinel, Frederic
    Bouvry, Pascal
    2020 20TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2020), 2020, : 760 - 766
  • [5] Benchmarking Resource Usage for Efficient Distributed Deep Learning
    Frey, Nathan C.
    Li, Baolin
    McDonald, Joseph
    Zhao, Dan
    Jones, Michael
    Bestor, David
    Tiwari, Devesh
    Gadepally, Vijay
    Samsi, Siddharth
    2022 IEEE HIGH PERFORMANCE EXTREME COMPUTING VIRTUAL CONFERENCE (HPEC), 2022,
  • [6] A Modular Benchmarking Infrastructure for High-Performance and Reproducible Deep Learning
    Ben-Nun, Tal
    Besta, Maciej
    Huber, Simon
    Ziogas, Alexandros Nikolaos
    Peter, Daniel
    Hoefler, Torsten
    2019 IEEE 33RD INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2019), 2019, : 66 - 77
  • [7] A Hybrid Parallelization Approach for Distributed and Scalable Deep Learning
    Akintoye, Samson B.
    Han, Liangxiu
    Zhang, Xin
    Chen, Haoming
    Zhang, Daoqiang
    IEEE ACCESS, 2022, 10 : 77950 - 77961
  • [8] Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques, and Tools
    Mayer, Ruben
    Jacobsen, Hans-Arno
    ACM COMPUTING SURVEYS, 2020, 53 (01)
  • [9] Hierarchical Heterogeneous Cluster Systems for Scalable Distributed Deep Learning
    Wang, Yibo
    Geng, Tongsheng
    Silva, Ericson
    Gaudiot, Jean-Luc
    2024 IEEE 27TH INTERNATIONAL SYMPOSIUM ON REAL-TIME DISTRIBUTED COMPUTING, ISORC 2024, 2024,
  • [10] Scalable Malware Detection System Using Distributed Deep Learning
    Kumar, Manish
    CYBERNETICS AND SYSTEMS, 2023, 54 (05) : 619 - 647