Performance Characterization of MPI_Allreduce in Cloud Data Center Networks

被引:0
|
作者
Musleh, Malek [1 ]
Alemania, Allister [1 ]
Penaranda, Roberto [1 ]
Segura, Pedro Yebenes [1 ]
机构
[1] Intel Corp, Santa Clara, CA 95051 USA
关键词
Data Center; Cloud Computing; Deep Learning; AI Training; HPC; Collectives; Congestion Control; AllReduce; Rabenseifner; Ring; Kary Tree; Recursive Doubling;
D O I
10.1109/MASCOTS53633.2021.9614297
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Advancements in hardware architecture and system design have enabled the transformational pivot towards cloud-computing. The rising cost of on-premise, vertical scaling, and maintenance combined with the rise of workload heterogeneity have fueled this paradigm shift. Furthermore, increasingly demanding use-cases for storage and High-Performance Computing (HPC) as well as the emergence of new workloads such as machine learning and big-data have motivated research into network traffic analysis, resource disaggregation, job scheduling, and orchestration in a bid to reduce total cost while maintaining high performance. A prime focus of this research is the communication performance of collectives, which comprise a significant portion of the communication of many of the aforementioned workloads. In this paper, we characterize the performance of MPI_Allreduce, which is used extensively in HPC and Deep-Learning (DL) training workloads in cloud environments. We demonstrate several key insights including that the Ring algorithm performs better than the Rabenseifner algorithm under conditions of high congestion and packet drops. We further illustrate detailed performance analysis under different assumptions and provide recommendations on how to address issues that can manifest.
引用
收藏
页码:57 / 64
页数:8
相关论文
共 50 条
  • [1] Recursive multi-factoring algorithm for MPI_Allreduce
    Imamura, Toshiyuki
    [J]. Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks, 2007, : 285 - 290
  • [2] Performance Evaluation of SDN-enhanced MPI_Allreduce on a Cluster System with Fat-tree Interconnect
    Takahashi, Keichi
    Khureltulga, Dashdavaa
    Watashiba, Yasuhiro
    Kido, Yoshiyuki
    Date, Susumu
    Shimojo, Shinji
    [J]. 2014 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING & SIMULATION (HPCS), 2014, : 784 - 792
  • [3] Evaluation of MPI Allreduce for Distributed Training of Convolutional Neural Networks
    Castello, Adrian
    Catalan, Mar
    Dolz, Manuel F.
    Mestre, Jose, I
    Quintana-Orti, Enrique S.
    Duato, Jose
    [J]. 2021 29TH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING (PDP 2021), 2021, : 109 - 116
  • [4] Analyzing the impact of the MPI allreduce in distributed training of convolutional neural networks
    Castello, Adrian
    Catalan, Mar
    Dolz, Manuel F.
    Quintana-Orti, Enrique S.
    Duato, Jose
    [J]. COMPUTING, 2023, 105 (05) : 1101 - 1119
  • [5] Analyzing the impact of the MPI allreduce in distributed training of convolutional neural networks
    Adrián Castelló
    Mar Catalán
    Manuel F. Dolz
    Enrique S. Quintana-Ortí
    José Duato
    [J]. Computing, 2023, 105 : 1101 - 1119
  • [6] Performance evaluation of transport protocols in cloud data center networks
    Tsiknas, Konstantinos G.
    Aidinidis, Paraskevas, I
    Zoiros, Kyriakos E.
    [J]. PHOTONIC NETWORK COMMUNICATIONS, 2021, 42 (02) : 105 - 116
  • [7] Performance evaluation of transport protocols in cloud data center networks
    Konstantinos G. Tsiknas
    Paraskevas I. Aidinidis
    Kyriakos E. Zoiros
    [J]. Photonic Network Communications, 2021, 42 : 105 - 116
  • [8] Cloud and Data Center Performance
    Li, Bo
    Li, Baochun
    Liu, Fangming
    [J]. IEEE NETWORK, 2013, 27 (04): : 6 - 7
  • [9] Enhancing Performance of Cloud Computing Data Center Networks by Hybrid Switching Architecture
    Yu, Xiaoshan
    Gu, Huaxi
    Wang, Kun
    Wu, Gang
    [J]. JOURNAL OF LIGHTWAVE TECHNOLOGY, 2014, 32 (10) : 1991 - 1998
  • [10] Green Data Center Placement in Optical Cloud Networks
    Wu, Yu
    Tornatore, Massimo
    Ferdousi, Sifat
    Mukherjee, Biswanath
    [J]. IEEE TRANSACTIONS ON GREEN COMMUNICATIONS AND NETWORKING, 2017, 1 (03): : 347 - 357