DDLBench: Towards a Scalable Benchmarking Infrastructure for Distributed Deep Learning

被引：6

作者：

Jansen, Matthijs ^{[1
,2
]}

Codreanu, Valeriu ^{[1
]}

Varbanescu, Ana-Lucia ^{[2
]}

机构：

[1] SURFsara, Amsterdam, Netherlands

[2] Univ Amsterdam, Amsterdam, Netherlands

来源：

PROCEEDINGS OF 2020 IEEE/ACM 5TH WORKSHOP ON DEEP LEARNING ON SUPERCOMPUTERS (DLS 2020) | 2020年

关键词：

D O I：

10.1109/DLS51937.2020.00009

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Due to its many applications across various fields of research, engineering, and daily life, deep learning has seen a surge in popularity. Therefore, larger and more expressive models have been proposed, with examples like Turing-NLG using as many as 17 billion parameters. Training these very large models becomes increasingly difficult due to the high computational costs and large memory footprint. Therefore, several approaches for distributed training based on data parallelism (e.g., Horovod) and model/pipeline parallelism (e.g., GPipe, PipeDream) have emerged. In this work, we focus on an in-depth comparison of three different parallelism models that address these needs: data, model and pipeline parallelism. To this end, we provide an analytical comparison of the three, both in terms of computation time and memory usage, and introduce DDLBench, a comprehensive (open-source1, ready-to-use) benchmark suite to quantify these differences in practice. Through in-depth performance analysis and experimentation with various models, datasets, distribution models and hardware systems, we demonstrate that DDLBench can accurately quantify the capability of a given system to perform distributed deep learning (DDL). By comparing our analytical models with the benchmarking results, we show how the performance of real-life implementations diverges from these analytical models, thus requiring benchmarking to capture the in-depth complexity of the frameworks themselves.

引用

页码：31 / 39

页数：9

共 50 条

[21] Scalable Computation Offloading for Industrial IoTs via Distributed Deep Reinforcement Learning
Dai, Bin
Qiu, Yuan
Feng, Weikun
PROCEEDINGS OF THE 2024 27 TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN, CSCWD 2024, 2024, : 1681 - 1686
[22] Towards Faster Distributed Deep Learning Data Hashing Techniques
Provatas, Nikodimos
Konstantinou, Ioannis
Koziris, Nectarios
2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2019, : 6189 - 6191
[23] Distributed Scalable Edge Computing Infrastructure for Open Metaverse
Zhou, Larry
Lambert, Jordan
Zheng, Yanyan
Li, Zheng
Yen, Alan
Liu, Sandra
Ye, Vivian
Zhou, Maggie
Mahar, David
Gibbons, John
Satterlee, Michael
2023 IEEE CLOUD SUMMIT, 2023, : 1 - 6
[24] Towards accelerating model parallelism in distributed deep learning systems
Choi, Hyeonseong
Lee, Byung Hyun
Chun, Se Young
Lee, Jaehwan
PLOS ONE, 2023, 18 (11):
[25] Fast and scalable all-optical network architecture for distributed deep learning
Li, Wenzhe
Yuan, Guojun
Wang, Zhan
Tan, Guangming
Zhang, Peiheng
Rouskas, George N.
JOURNAL OF OPTICAL COMMUNICATIONS AND NETWORKING, 2024, 16 (03) : 342 - 357
[26] Towards an infrastructure for MLS distributed computing
Kang, MH
Froscher, JN
Eppinger, BJ
14TH ANNUAL COMPUTER SECURITY APPLICATIONS CONFERENCE, PROCEEDINGS, 1998, : 91 - 100
[27] Towards More Realistic Simulated Datasets for Benchmarking Deep Learning Models in Regulatory Genomics
Prakash, Eva
Shrikumar, Avanti
Kundaje, Anshul
MACHINE LEARNING IN COMPUTATIONAL BIOLOGY, VOL 165, 2021, 165 : 58 - 77
[28] GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server
Cui, Henggang
Zhang, Hao
Ganger, Gregory R.
Gibbons, Phillip B.
Xing, Eric P.
PROCEEDINGS OF THE ELEVENTH EUROPEAN CONFERENCE ON COMPUTER SYSTEMS, (EUROSYS 2016), 2016,
[29] Preserving Near-Optimal Gradient Sparsification Cost for Scalable Distributed Deep Learning
Yoon, Daegun
Oh, Sangyoon
arXiv,
[30] Distributed and Scalable Cooperative Formation of Unmanned Ground Vehicles Using Deep Reinforcement Learning
Huang, Shichun
Wang, Tao
Tang, Yong
Hu, Yiwen
Xin, Gu
Zhou, Dianle
AEROSPACE, 2023, 10 (02)

← 1 2 3 4 5 →