SUARA: A scalable universal allreduce communication algorithm for acceleration of parallel deep learning applications

被引：1

作者：

Nuriyev, Emin ^{[1
]}

Manumachu, Ravi Reddy ^{[1
]}

Aseeri, Samar ^{[2
]}

Verma, Mahendra K. ^{[3
]}

Lastovetsky, Alexey L. ^{[1
]}

机构：

[1] Univ Coll Dublin, Sch Comp Sci, Dublin, Ireland

[2] King Abdullah Univ Sci & Technol KAUST, Extreme Comp Res Ctr ECRC, Thuwal, Saudi Arabia

[3] Indian Inst Technol Kanpur, Dept Phys, Kanpur, India

来源：

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING | 2024年 / 183卷

关键词：

Allreduce communication algorithm; MPI; Parallel deep learning; ResNet-50; Imagenet; HIGH-PERFORMANCE; MPI;

D O I：

10.1016/j.jpdc.2023.104767

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Parallel and distributed deep learning (PDNN) has become an effective strategy to reduce the long training times of large-scale deep neural networks. Mainstream PDNN software packages based on the message-passing interface (MPI) and employing synchronous stochastic gradient descent rely crucially on the performance of MPI allreduce collective communication routine. In this work, we propose a novel scalable universal allreduce meta-algorithm called SUARA. In general, SUARA consists of L serial steps, where L >= 2, executed by all MPI processes involved in the allreduce operation. At each step, SUARA partitions this set of processes into subsets, which execute optimally selected library allreduce algorithms to solve sub-allreduce problems on these subsets in parallel, to accomplish the whole allreduce operation after completing all the L steps. We then design, theoretically study and implement a two-step SUARA (L = 2) called SUARA2 on top of the Open MPI library. We prove that the theoretical asymptotic speedup of SUARA2 executed by P processes over the base Open MPI routine is O( P). Our experiments on Shaheen-II supercomputer employing 1024 nodes demonstrate over 2x speedup of SUARA2 over native Open MPI allreduce routine, which translates into the performance improvement of training of ResNet-50 DNN on ImageNet by 9%. (c) 2023 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons .org /licenses /by /4 .0/).

引用

页数：15

共 50 条

[1] A parallel deep learning algorithm with applications in process monitoring and fault prediction
Qian, Hong
Sun, Bo
Guo, Yuanjun
Yang, Zhile
Ling, Jun
Feng, Wei
COMPUTERS & ELECTRICAL ENGINEERING, 2022, 99
[2] A scalable physician-level deep learning algorithm detects universal trauma on pelvic radiographs
Chi-Tung Cheng
Yirui Wang
Huan-Wu Chen
Po-Meng Hsiao
Chun-Nan Yeh
Chi-Hsun Hsieh
Shun Miao
Jing Xiao
Chien-Hung Liao
Le Lu
Nature Communications, 12
[3] A scalable physician-level deep learning algorithm detects universal trauma on pelvic radiographs
Cheng, Chi-Tung
Wang, Yirui
Chen, Huan-Wu
Hsiao, Po-Meng
Yeh, Chun-Nan
Hsieh, Chi-Hsun
Miao, Shun
Xiao, Jing
Liao, Chien-Hung
Lu, Le
NATURE COMMUNICATIONS, 2021, 12 (01)
[4] PARALLEL I/O OPTIMIZATIONS FOR SCALABLE DEEP LEARNING
Pumma, Sarunya
Si, Min
Feng, Wu-chun
Balaji, Pavan
2017 IEEE 23RD INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2017, : 720 - 729
[5] Scalable deep learning for healthcare: methods and applications
Barillaro, Luca
Agapito, Giuseppe
Cannataro, Mario
13TH ACM INTERNATIONAL CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND HEALTH INFORMATICS, BCB 2022, 2022,
[6] Parallel Gym Gazebo: a Scalable Parallel Robot Deep Reinforcement Learning Platform
Liang, Zhen
Cai, Zhongxuan
Li, Minglong
Yang, Wenjing
2019 IEEE 31ST INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2019), 2019, : 206 - 213
[7] An Allreduce Algorithm and Network Co-design for Large-Scale Training of Distributed Deep Learning
Nguyen, Truong Thao
Wahib, Mohamed
21ST IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2021), 2021, : 396 - 405
[8] Towards a Scalable and Distributed Infrastructure for Deep Learning Applications
Hasheminezhad, Bita
Shirzad, Shahrzad
Wu, Nanmiao
Diehl, Patrick
Schulz, Hannes
Kaiser, Hartmut
PROCEEDINGS OF 2020 IEEE/ACM 5TH WORKSHOP ON DEEP LEARNING ON SUPERCOMPUTERS (DLS 2020), 2020, : 20 - 30
[9] Performance Issues of Parallel, Scalable Convolutional Neural Networks in Deep Learning
Chavan, Umesh
Kulkarni, Dinesh
COMPUTING, COMMUNICATION AND SIGNAL PROCESSING, ICCASP 2018, 2019, 810 : 333 - 343
[10] Applications of Deep Learning in Satellite Communication: A Survey
He, Yuanzhi
Sheng, Biao
Li, Yuan
Wang, Changxu
Chen, Xiang
Liu, Jinchao
SPACE INFORMATION NETWORKS, SINC 2023, 2024, 2057 : 17 - 33

← 1 2 3 4 5 →