Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

被引:0
|
作者
Dathathri, Roshan [1 ]
Reddy, Chandan [1 ]
Ramashekar, Thejas [1 ]
Bondhugula, Uday [1 ]
机构
[1] Indian Inst Sci, Dept Comp Sci & Automat, Bangalore 560012, Karnataka, India
关键词
communication optimization; data movement; polyhedral model; distributed memory; heterogeneous architectures;
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Programming for parallel architectures that do not have a shared address space is extremely difficult due to the need for explicit communication between memories of different compute devices. A heterogeneous system with CPUs and multiple GPUs, or a distributed-memory cluster are examples of such systems. Past works that try to automate data movement for distributed-memory architectures can lead to excessive redundant communication. In this paper, we propose an automatic data movement scheme that minimizes the volume of communication between compute devices in heterogeneous and distributed-memory systems. We show that by partitioning data dependences in a particular non-trivial way, one can generate data movement code that results in the minimum volume for a vast majority of cases. The techniques are applicable to any sequence of affine loop nests and works on top of any choice of loop transformations, parallelization, and computation placement. The data movement code generated minimizes the volume of communication for a particular configuration of these. We use a combination of powerful static analyses relying on the polyhedral compiler framework and lightweight runtime routines they generate, to build a source-to-source transformation tool that automatically generates communication code. We demonstrate that the tool is scalable and leads to substantial gains in efficiency. On a heterogeneous system, the communication volume is reduced by a factor of 11X to 83X over state-of-the-art, translating into a mean execution time speedup of 1.53X. On a distributed-memory cluster, our scheme reduces the communication volume by a factor of 1.4X to 63.5X over state-of-the-art, resulting in a mean speedup of 1.55X. In addition, our scheme yields a mean speedup of 2.19X over hand-optimized UPC codes.
引用
收藏
页码:375 / 386
页数:12
相关论文
共 50 条
  • [1] Code Generation for Distributed-Memory Architectures
    Zhao, Jie
    Zhao, Rongcai
    Xu, Jinchen
    [J]. COMPUTER JOURNAL, 2016, 59 (01): : 119 - 132
  • [2] DATA AND TASK ALIGNMENT IN DISTRIBUTED-MEMORY ARCHITECTURES
    SINHAROY, B
    SZYMANSKI, BK
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1994, 21 (01) : 61 - 74
  • [3] COMPILING FOR DISTRIBUTED-MEMORY ARCHITECTURES
    ROGERS, A
    PINGALI, K
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 1994, 5 (03) : 281 - 298
  • [4] Efficient Lagrangian particle tracking algorithms for distributed-memory architectures
    Baldan, Giacomo
    Bellosta, Tommaso
    Guardone, Alberto
    [J]. COMPUTERS & FLUIDS, 2023, 256
  • [5] PARALLEL RENDERING OF VOLUMETRIC DATA SET ON DISTRIBUTED-MEMORY ARCHITECTURES
    MONTANI, C
    PEREGO, R
    SCOPIGNO, R
    [J]. CONCURRENCY-PRACTICE AND EXPERIENCE, 1993, 5 (02): : 153 - 167
  • [6] Parallelizing RRT on Distributed-Memory Architectures
    Devaurs, Didier
    Simeon, Thierry
    Cortes, Juan
    [J]. 2011 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 2011, : 2261 - 2266
  • [7] Parallel ILP for distributed-memory architectures
    Nuno A. Fonseca
    Ashwin Srinivasan
    Fernando Silva
    Rui Camacho
    [J]. Machine Learning, 2009, 74 : 257 - 279
  • [8] Parallel ILP for distributed-memory architectures
    Fonseca, Nuno A.
    Srinivasan, Ashwin
    Silva, Fernando
    Camacho, Rui
    [J]. MACHINE LEARNING, 2009, 74 (03) : 257 - 279
  • [9] Efficient Embedded Software Migration towards Clusterized Distributed-Memory Architectures
    Garibotti, Rafael
    Butko, Anastasiia
    Ost, Luciano
    Gamatie, Abdoulaye
    Sassatelli, Gilles
    Adeniyi-Jones, Chris
    [J]. IEEE TRANSACTIONS ON COMPUTERS, 2016, 65 (08) : 2645 - 2651
  • [10] Generating efficient tiled code for distributed memory machines
    Tang, PY
    Xue, JL
    [J]. PARALLEL COMPUTING, 2000, 26 (11) : 1369 - 1410