ADAPT: An Event-Based Adaptive Collective Communication Framework

被引:15
|
作者
Luo, Xi [1 ]
Wu, Wei [2 ]
Bosilca, George [1 ]
Patinyasakdikul, Thananon [1 ]
Wang, Linnan [3 ]
Dongarra, Jack [1 ,4 ]
机构
[1] Univ Tennessee, Knoxville, TN 37996 USA
[2] Los Alamos Natl Lab, Los Alamos, NM USA
[3] Brown Univ, Providence, RI 02912 USA
[4] Oak Ridge Natl Lab, Oak Ridge, TN USA
关键词
MPI; event-driven; system noise; collectives operations; GPU; heterogeneous system; PERFORMANCE;
D O I
10.1145/3208040.3208054
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The increase in scale and heterogeneity of high-performance computing (HPC) systems predispose the performance of Message Passing Interface (MPI) collective communications to be susceptible to noise, and to adapt to a complex mix of hardware capabilities. The designs of state of the art MPI collectives heavily rely on synchronizations; these designs magnify noise across the participating processes, resulting in significant performance slowdown. Therefore, such design philosophy must be reconsidered to efficiently and robustly run on the large-scale heterogeneous platforms. In this paper, we present ADAPT, a new collective communication framework in Open MPI, using event-driven techniques to morph collective algorithms to heterogeneous environments. The core concept of ADAPT is to relax synchronizations, while maintaining the minimal data dependencies of MPI collectives. To fully exploit the different bandwidths of data movement lanes in heterogeneous systems, we extend the ADAPT collective framework with a topology-aware communication tree. This removes the boundaries of different hardware topologies while maximizing the speed of data movements. We evaluate our framework with two popular collective operations: broadcast and reduce on both CPU and GPU clusters. Our results demonstrate drastic performance improvements and a strong resistance against noise compared to other state of the art MPI libraries. In particular, we demonstrate at least 1.3x and 1.5x speedup for CPU data and 2x and 10x speedup for GPU data using ADAPT event-based broadcast and reduce operations.
引用
收藏
页码:118 / 130
页数:13
相关论文
共 50 条
  • [21] Event-Based Communication in Distributed Q-Learning
    Ornia, Daniel Jarne
    Mazo, Manuel
    [J]. 2022 IEEE 61ST CONFERENCE ON DECISION AND CONTROL (CDC), 2022, : 2379 - 2386
  • [22] Event-based control with communication delays and packet losses
    Lehmann, D.
    Lunze, J.
    [J]. INTERNATIONAL JOURNAL OF CONTROL, 2012, 85 (05) : 563 - 577
  • [23] Communication Rate Analysis for Event-based State Estimation
    Ebner, Simon
    Trimpe, Sebastian
    [J]. 2016 13TH INTERNATIONAL WORKSHOP ON DISCRETE EVENT SYSTEMS (WODES), 2016, : 189 - 196
  • [24] Multi-Agent Coordination with Event-based Communication
    Teixeira, Pedro V.
    Dimarogonas, Dimos V.
    Johansson, Karl H.
    Sousa, Joao
    [J]. 2010 AMERICAN CONTROL CONFERENCE, 2010, : 824 - 829
  • [25] Synchronization of Dynamical Networks with Distributed Event-Based Communication
    Liu, Tao
    Hill, David J.
    Liu, Bin
    [J]. 2012 IEEE 51ST ANNUAL CONFERENCE ON DECISION AND CONTROL (CDC), 2012, : 7199 - 7204
  • [26] Event-Based Communication in Distributed Model Predictive Control
    Gross, Dominic
    Jilg, Martin
    Stursberg, Olaf
    [J]. AT-AUTOMATISIERUNGSTECHNIK, 2013, 61 (07) : 457 - 465
  • [27] Resilient Consensus Through Asynchronous Event-based Communication
    Wang, Yuan
    Ishii, Hideaki
    [J]. 2019 AMERICAN CONTROL CONFERENCE (ACC), 2019, : 1842 - 1847
  • [28] A tutorial on event-based optimization—a new optimization framework
    Li Xia
    Qing-Shan Jia
    Xi-Ren Cao
    [J]. Discrete Event Dynamic Systems, 2014, 24 : 103 - 132
  • [29] A Reconstructed Event-based Framework for Analyzing Community Evolution
    Zhu, Jinxiu
    Liu, Jun
    Zhang, Xuewu
    Zhao, Yafang
    [J]. PROCEEDINGS OF 2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA ANALYSIS (ICBDA), 2016, : 217 - 220
  • [30] Framework for tracking the event-based evolution in social networks
    Wu, Bin
    Wang, Bai
    Yang, Sheng-Qi
    [J]. Ruan Jian Xue Bao/Journal of Software, 2011, 22 (07): : 1488 - 1502