SUARA: A scalable universal allreduce communication algorithm for acceleration of parallel deep learning applications

被引：1

作者：

Nuriyev, Emin ^{[1
]}

Manumachu, Ravi Reddy ^{[1
]}

Aseeri, Samar ^{[2
]}

Verma, Mahendra K. ^{[3
]}

Lastovetsky, Alexey L. ^{[1
]}

机构：

[1] Univ Coll Dublin, Sch Comp Sci, Dublin, Ireland

[2] King Abdullah Univ Sci & Technol KAUST, Extreme Comp Res Ctr ECRC, Thuwal, Saudi Arabia

[3] Indian Inst Technol Kanpur, Dept Phys, Kanpur, India

来源：

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING | 2024年 / 183卷

关键词：

Allreduce communication algorithm; MPI; Parallel deep learning; ResNet-50; Imagenet; HIGH-PERFORMANCE; MPI;

D O I：

10.1016/j.jpdc.2023.104767

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Parallel and distributed deep learning (PDNN) has become an effective strategy to reduce the long training times of large-scale deep neural networks. Mainstream PDNN software packages based on the message-passing interface (MPI) and employing synchronous stochastic gradient descent rely crucially on the performance of MPI allreduce collective communication routine. In this work, we propose a novel scalable universal allreduce meta-algorithm called SUARA. In general, SUARA consists of L serial steps, where L >= 2, executed by all MPI processes involved in the allreduce operation. At each step, SUARA partitions this set of processes into subsets, which execute optimally selected library allreduce algorithms to solve sub-allreduce problems on these subsets in parallel, to accomplish the whole allreduce operation after completing all the L steps. We then design, theoretically study and implement a two-step SUARA (L = 2) called SUARA2 on top of the Open MPI library. We prove that the theoretical asymptotic speedup of SUARA2 executed by P processes over the base Open MPI routine is O( P). Our experiments on Shaheen-II supercomputer employing 1024 nodes demonstrate over 2x speedup of SUARA2 over native Open MPI allreduce routine, which translates into the performance improvement of training of ResNet-50 DNN on ImageNet by 9%. (c) 2023 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons .org /licenses /by /4 .0/).

引用

页数：15

共 50 条

[41] Research on Parallel Acceleration for Deep Learning Inference Based on Many-Core ARM Platform
Zhu, Keqian
Jiang, Jingfei
ADVANCED COMPUTER ARCHITECTURE, 2018, 908 : 30 - 41
[42] Deep Learning-Enhanced Parallel Imaging and Simultaneous Multislice Acceleration Reconstruction in Knee MRI
Kim, MinWoo
Lee, Sang-Min
Park, Chankue
Lee, Dongeon
Kim, Kang Soo
Jeong, Hee Seok
Kim, Shinyoung
Choi, Min-Hyeok
Nickel, Dominik
INVESTIGATIVE RADIOLOGY, 2022, 57 (12) : 826 - 833
[43] An Enhanced Secure Deep Learning Algorithm for Fraud Detection in Wireless Communication
Sanober, Sumaya
Alam, Izhar
Pande, Sagar
Arslan, Farrukh
Rane, Kantilal Pitambar
Singh, Bhupesh Kumar
Khamparia, Aditya
Shabaz, Mohammad
WIRELESS COMMUNICATIONS & MOBILE COMPUTING, 2021, 2021
[44] A Collaborative Communication Jamming Decision Algorithm Based on Deep Reinforcement Learning
Song B.-L.
Xu H.
Qi Z.-S.
Rao N.
Peng X.
Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2022, 50 (06): : 1301 - 1309
[45] Scalable 2D K-SVD Parallel Algorithm for Dictionary Learning on GPUs
He, Lu
Miskell, Timothy
Liu, Rui
Yu, Hengyong
Xu, Huijuan
Luo, Yan
PROCEEDINGS OF THE ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS (CF'16), 2016, : 11 - 18
[46] Invited: Algorithm-Software-Hardware Co-Design for Deep Learning Acceleration
Li, Zhengang
Xie, Yanyue
Dong, Peiyan
Chen, Olivia
Wang, Yanzhi
2023 60TH ACM/IEEE DESIGN AUTOMATION CONFERENCE, DAC, 2023,
[47] Standardization and acceleration of OCT Angiography image quality assessment using a deep learning algorithm
Lauermann, Jost Lennart
Treder, Maximilian
Alnawaiseh, Maged
Clemens, Chrristoph
Eter, Nicole
Alten, Florian
INVESTIGATIVE OPHTHALMOLOGY & VISUAL SCIENCE, 2019, 60 (09)
[48] A parallel multi-module deep reinforcement learning algorithm for stock trading
Ma, Cong
Zhang, Jiangshe
Liu, Junmin
Ji, Lizhen
Gao, Fei
NEUROCOMPUTING, 2021, 449 : 290 - 302
[49] A survey on machine learning algorithm applications in visible light communication systems
Sliti, Maha
Mrabet, Manel
Garai, Mouna
Ammar, Lassaad Ben
OPTICAL AND QUANTUM ELECTRONICS, 2024, 56 (08)
[50] Scalable Parallel Task Scheduling for Autonomous Driving Using Multi-Task Deep Reinforcement Learning
Qi, Qi
Zhang, Lingxin
Wang, Jingyu
Sun, Haifeng
Zhuang, Zirui
Liao, Jianxin
Yu, F. Richard
IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, 2020, 69 (11) : 13861 - 13874

← 1 2 3 4 5 →