DDLBench: Towards a Scalable Benchmarking Infrastructure for Distributed Deep Learning

被引：6

作者：

Jansen, Matthijs ^{[1
,2
]}

Codreanu, Valeriu ^{[1
]}

Varbanescu, Ana-Lucia ^{[2
]}

机构：

[1] SURFsara, Amsterdam, Netherlands

[2] Univ Amsterdam, Amsterdam, Netherlands

来源：

PROCEEDINGS OF 2020 IEEE/ACM 5TH WORKSHOP ON DEEP LEARNING ON SUPERCOMPUTERS (DLS 2020) | 2020年

关键词：

D O I：

10.1109/DLS51937.2020.00009

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Due to its many applications across various fields of research, engineering, and daily life, deep learning has seen a surge in popularity. Therefore, larger and more expressive models have been proposed, with examples like Turing-NLG using as many as 17 billion parameters. Training these very large models becomes increasingly difficult due to the high computational costs and large memory footprint. Therefore, several approaches for distributed training based on data parallelism (e.g., Horovod) and model/pipeline parallelism (e.g., GPipe, PipeDream) have emerged. In this work, we focus on an in-depth comparison of three different parallelism models that address these needs: data, model and pipeline parallelism. To this end, we provide an analytical comparison of the three, both in terms of computation time and memory usage, and introduce DDLBench, a comprehensive (open-source1, ready-to-use) benchmark suite to quantify these differences in practice. Through in-depth performance analysis and experimentation with various models, datasets, distribution models and hardware systems, we demonstrate that DDLBench can accurately quantify the capability of a given system to perform distributed deep learning (DDL). By comparing our analytical models with the benchmarking results, we show how the performance of real-life implementations diverges from these analytical models, thus requiring benchmarking to capture the in-depth complexity of the frameworks themselves.

引用

页码：31 / 39

页数：9

共 50 条

[1] Towards a Scalable and Distributed Infrastructure for Deep Learning Applications
Hasheminezhad, Bita
Shirzad, Shahrzad
Wu, Nanmiao
Diehl, Patrick
Schulz, Hannes
Kaiser, Hartmut
PROCEEDINGS OF 2020 IEEE/ACM 5TH WORKSHOP ON DEEP LEARNING ON SUPERCOMPUTERS (DLS 2020), 2020, : 20 - 30
[2] The Design and Implementation of a Scalable Deep Learning Benchmarking Platform
Li, Cheng
Dakkak, Abdul
Xiong, Jinjun
Hwu, Wen-mei
2020 IEEE 13TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING (CLOUD 2020), 2020, : 414 - 425
[3] ScaDL 2022: Fourth IPDPS Workshop on Scalable Deep Learning over Parallel and Distributed Infrastructure
Ardagna, Danilo
Patterson, Stacy
Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2022, 2022,
[4] Performance Analysis of Distributed and Scalable Deep Learning
Mahon, Sean
Varrette, Sebastien
Plugaru, Valentin
Pinel, Frederic
Bouvry, Pascal
2020 20TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2020), 2020, : 760 - 766
[5] Benchmarking Resource Usage for Efficient Distributed Deep Learning
Frey, Nathan C.
Li, Baolin
McDonald, Joseph
Zhao, Dan
Jones, Michael
Bestor, David
Tiwari, Devesh
Gadepally, Vijay
Samsi, Siddharth
2022 IEEE HIGH PERFORMANCE EXTREME COMPUTING VIRTUAL CONFERENCE (HPEC), 2022,
[6] A Modular Benchmarking Infrastructure for High-Performance and Reproducible Deep Learning
Ben-Nun, Tal
Besta, Maciej
Huber, Simon
Ziogas, Alexandros Nikolaos
Peter, Daniel
Hoefler, Torsten
2019 IEEE 33RD INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2019), 2019, : 66 - 77
[7] A Hybrid Parallelization Approach for Distributed and Scalable Deep Learning
Akintoye, Samson B.
Han, Liangxiu
Zhang, Xin
Chen, Haoming
Zhang, Daoqiang
IEEE ACCESS, 2022, 10 : 77950 - 77961
[8] Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques, and Tools
Mayer, Ruben
Jacobsen, Hans-Arno
ACM COMPUTING SURVEYS, 2020, 53 (01)
[9] Hierarchical Heterogeneous Cluster Systems for Scalable Distributed Deep Learning
Wang, Yibo
Geng, Tongsheng
Silva, Ericson
Gaudiot, Jean-Luc
2024 IEEE 27TH INTERNATIONAL SYMPOSIUM ON REAL-TIME DISTRIBUTED COMPUTING, ISORC 2024, 2024,
[10] Scalable Malware Detection System Using Distributed Deep Learning
Kumar, Manish
CYBERNETICS AND SYSTEMS, 2023, 54 (05) : 619 - 647

← 1 2 3 4 5 →