SUARA: A scalable universal allreduce communication algorithm for acceleration of parallel deep learning applications

被引:1
|
作者
Nuriyev, Emin [1 ]
Manumachu, Ravi Reddy [1 ]
Aseeri, Samar [2 ]
Verma, Mahendra K. [3 ]
Lastovetsky, Alexey L. [1 ]
机构
[1] Univ Coll Dublin, Sch Comp Sci, Dublin, Ireland
[2] King Abdullah Univ Sci & Technol KAUST, Extreme Comp Res Ctr ECRC, Thuwal, Saudi Arabia
[3] Indian Inst Technol Kanpur, Dept Phys, Kanpur, India
关键词
Allreduce communication algorithm; MPI; Parallel deep learning; ResNet-50; Imagenet; HIGH-PERFORMANCE; MPI;
D O I
10.1016/j.jpdc.2023.104767
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Parallel and distributed deep learning (PDNN) has become an effective strategy to reduce the long training times of large-scale deep neural networks. Mainstream PDNN software packages based on the message-passing interface (MPI) and employing synchronous stochastic gradient descent rely crucially on the performance of MPI allreduce collective communication routine. In this work, we propose a novel scalable universal allreduce meta-algorithm called SUARA. In general, SUARA consists of L serial steps, where L >= 2, executed by all MPI processes involved in the allreduce operation. At each step, SUARA partitions this set of processes into subsets, which execute optimally selected library allreduce algorithms to solve sub-allreduce problems on these subsets in parallel, to accomplish the whole allreduce operation after completing all the L steps. We then design, theoretically study and implement a two-step SUARA (L = 2) called SUARA2 on top of the Open MPI library. We prove that the theoretical asymptotic speedup of SUARA2 executed by P processes over the base Open MPI routine is O( P). Our experiments on Shaheen-II supercomputer employing 1024 nodes demonstrate over 2x speedup of SUARA2 over native Open MPI allreduce routine, which translates into the performance improvement of training of ResNet-50 DNN on ImageNet by 9%. (c) 2023 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons .org /licenses /by /4 .0/).
引用
收藏
页数:15
相关论文
共 50 条
  • [31] A Scalable Parallel Algorithm for Self-Organizing Maps with Applications to Sparse Data Mining Problems
    R.D. Lawrence
    G.S. Almasi
    H.E. Rushmeier
    Data Mining and Knowledge Discovery, 1999, 3 : 171 - 195
  • [32] A scalable parallel algorithm for self-organizing maps with applications to sparse data mining problems
    Lawrence, RD
    Almasi, GS
    Rushmeier, HE
    DATA MINING AND KNOWLEDGE DISCOVERY, 1999, 3 (02) : 171 - 195
  • [33] swCaffe: a Parallel Framework for Accelerating Deep Learning Applications on Sunway TaihuLight
    Li, Liandeng
    Fang, Jiarui
    Fu, Haohuan
    Jiang, Jinlei
    Zhao, Wenlai
    He, Conghui
    You, Xin
    Yang, Guangwen
    2018 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2018, : 413 - 422
  • [34] Accelerating Collective Communication in Data Parallel Training across Deep Learning Frameworks
    Romero, Joshua
    Yin, Junqi
    Laanait, Nouamane
    Xie, Bing
    Young, M. Todd
    Treichler, Sean
    Starchenko, Vitalii
    Borisevich, Albina
    Sergeev, Alex
    Matheson, Michael
    PROCEEDINGS OF THE 19TH USENIX SYMPOSIUM ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION (NSDI '22), 2022, : 1027 - 1040
  • [35] ScaDL 2022: Fourth IPDPS Workshop on Scalable Deep Learning over Parallel and Distributed Infrastructure
    Ardagna, Danilo
    Patterson, Stacy
    Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2022, 2022,
  • [36] Universal Vertical Applications Adaptation for Open RAN: A Deep Reinforcement Learning Approach
    Huang, Yi-Cheng
    Lien, Shao-Yu
    Tseng, Chih-Cheng
    Deng, Der-Jiunn
    Chen, Kwang-Cheng
    2022 25TH INTERNATIONAL SYMPOSIUM ON WIRELESS PERSONAL MULTIMEDIA COMMUNICATIONS (WPMC), 2022,
  • [37] Deep Learning Algorithm and Applications in Location Big Data Mining
    Gao, Fa-Qin
    Xia, Hai-Xia
    FUZZY SYSTEM AND DATA MINING, 2016, 281 : 169 - 174
  • [38] An Efficient Algorithm for Mapping Deep Learning Applications on the NoC Architecture
    Khan, Zeeshan Ali
    Abbasi, Ubaid
    Kim, Sung Won
    APPLIED SCIENCES-BASEL, 2022, 12 (06):
  • [39] Deep Learning Based Face Detection Algorithm for Mobile Applications
    Almadhor, Ahmad
    PROCEEDINGS OF TENCON 2018 - 2018 IEEE REGION 10 CONFERENCE, 2018, : 1158 - 1162
  • [40] A 1.93TOPS/W Scalable Deep Learning/Inference Processor with Tetra-Parallel MIMD Architecture for Big-Data Applications
    Park, Seongwook
    Bong, Kyeongryeol
    Shin, Dongjoo
    Lee, Jinmook
    Choi, Sungpill
    Yoo, Hoi-Jun
    2015 IEEE INTERNATIONAL SOLID-STATE CIRCUITS CONFERENCE DIGEST OF TECHNICAL PAPERS (ISSCC), 2015, 58 : 80 - U105