High-Performance and Scalable Design of MPI-3 RMA on Xeon Phi Clusters

被引:0
|
作者
Li, Mingzhe [1 ]
Hamidouche, Khaled [1 ]
Lu, Xiaoyi [1 ]
Lin, Jian [1 ]
Panda, Dhabaleswar K. [1 ]
机构
[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA
来源
关键词
D O I
10.1007/978-3-662-48096-0_48
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Intel Many Integrated Core (MIC) architectures have been playing a key role in modern supercomputing systems due to the features of high performance and low power consumption. This makes them become an attractive choice to accelerate HPC applications. MPI-3 RMA is an important part of the MPI-3 standard. It provides one-sided semantics that reduce the synchronization overhead and allow overlapping of communication with computation. This makes the RMA model the first target for developing scalable applications with irregular communication patterns. However, an efficient runtime support for MPI-3 RMA with simultaneous use of both processors and co-processors is still not well exploited. In this paper, we propose high-performance and scalable runtime-level designs for MPI-3 RMA involving both the host and Xeon Phi processors. We incorporate our designs into the popular MVAPICH2 MPI library. To the best of our knowledge, this is the first research work that proposes high-performance runtime designs for MPI-3 RMA on Intel Xeon Phi clusters. Experimental evaluations indicate a reduction of 5X in the uni-directional MPI Put and MPI Get latency for 4 MB messages between two Xeon Phis, compared to an out-of-the-box version of MVAPICH2. Application evaluations in the symmetric mode show performance improvements of 25% at the scale of 1,024 processes.
引用
收藏
页码:625 / 637
页数:13
相关论文
共 50 条
  • [1] Scalable Graph500 Design with MPI-3 RMA
    Li, Mingzhe
    Lu, Xiaoyi
    Potluri, Sreeram
    Hamidouche, Khaled
    Jose, Jithin
    Tomko, Karen
    Panda, Dhabaleswar K.
    [J]. 2014 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2014, : 230 - 238
  • [2] Using MPI-3 RMA for Active Messages
    Schuchart, Joseph
    Bouteiller, Aurelien
    Bosilca, George
    [J]. PROCEEDINGS OF 2019 IEEE/ACM WORKSHOP ON EXASCALE MPI (EXAMPI 2019), 2019, : 47 - 56
  • [3] MVAPICH2-MIC: A High Performance MPI Library for Xeon Phi Clusters with InfiniBand
    Potluri, Sreeram
    Hamidouche, Khaled
    Bureddy, Devendar
    Panda, Dhabaleswar K.
    [J]. 2013 EXTREME SCALING WORKSHOP (XSW 2013), 2014, : 25 - 32
  • [4] Efficient implementation of MPI-3 RMA over openFabrics interfaces
    Fujita, Hajime
    Cao, Chongxiao
    Sur, Sayantan
    Archer, Charles
    Paulson, Erik
    Garzaran, Maria
    [J]. PARALLEL COMPUTING, 2019, 87 : 1 - 10
  • [5] Recent Experiences in Using MPI-3 RMA in the DASH PGAS Runtime
    Schuchart, Joseph
    Kowalewski, Roger
    Fuerlinger, Karl
    [J]. HPC ASIA'18: PROCEEDINGS OF WORKSHOPS OF HPC ASIA, 2018, : 21 - 30
  • [6] RMACXX: An Efficient High-Level C plus plus Interface over MPI-3 RMA
    Ghosh, Sayan
    Guo, Yanfei
    Balaji, Pavan
    Gebremedhin, Assefaw H.
    [J]. 21ST IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2021), 2021, : 143 - 155
  • [7] High Performance OpenSHMEM for Xeon Phi Clusters: Extensions, Runtime Designs and Application Co-design
    Jose, Jithin
    Hamidouche, Khaled
    Lu, Xiaoyi
    Potluri, Sreeram
    Zhang, Jie
    Tomko, Karen
    Panda, Dhabaleswar K.
    [J]. 2014 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2014, : 10 - 18
  • [8] High-Performance Incremental SVM Learning on Intel® Xeon Phi™ Processors
    Sengupta, Dipanjan
    Wang, Yida
    Sundaram, Narayanan
    Willke, Theodore L.
    [J]. HIGH PERFORMANCE COMPUTING (ISC HIGH PERFORMANCE 2017), 2017, 10266 : 120 - 138
  • [9] Performance Optimization of OpenFOAM* on Clusters of Intel® Xeon Phi™ Processors
    Ojha, Ravi
    Pawar, Prasad
    Gupta, Sonia
    Klemm, Michael
    Nambiar, Manoj
    [J]. 2017 IEEE 24TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING WORKSHOPS (HIPCW), 2017, : 51 - 59
  • [10] A Novel Functional Partitioning Approach to Design High-Performance MPI-3 Non-Blocking Alltoallv Collective on Multi-core Systems
    Kandalla, K.
    Subramoni, H.
    Tomko, K.
    Pekurovsky, D.
    Panda, D. K.
    [J]. 2013 42ND ANNUAL INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP), 2013, : 611 - 620