CUDA-Zero: a framework for porting shared memory GPU applications to multi-GPUs

被引:0
|
作者
DeHao Chen
WenGuang Chen
WeiMin Zheng
机构
[1] Tsinghua University,Department of Computer Science and Technology
来源
关键词
CUDA; parallelization; data access pattern; multi-GPU;
D O I
暂无
中图分类号
学科分类号
摘要
As the prevalence of general purpose computations on GPU, shared memory programming models were proposed to ease the pain of GPU programming. However, with the demanding needs of more intensive workloads, it’s desirable to port GPU programs to more scalable distributed memory environment, such as multi-GPUs. To achieve this, programs need to be re-written with mixed programming models (e.g. CUDA and message passing). Programmers not only need to work carefully on workload distribution, but also on scheduling mechanisms to ensure the efficiency of the execution. In this paper, we studied the possibilities of automating the process of parallelization to multi-GPUs. Starting from a GPU program written in shared memory model, our framework analyzes the access patterns of arrays in kernel functions to derive the data partition schemes. To acquire the access pattern, we proposed a 3-tiers approach: static analysis, profile based analysis and user annotation. Experiments show that most access patterns can be derived correctly by the first two tiers, which means that zero efforts are needed to port an existing application to distributed memory environment. We use our framework to parallelize several applications, and show that for certain kinds of applications, CUDA-Zero can achieve efficient parallelization in multi-GPU environment.
引用
收藏
页码:663 / 676
页数:13
相关论文
共 41 条
  • [31] An optimization-based shared control framework with applications in multi-robot systems
    Fang, Hao
    Shang, Chengsi
    Chen, Jie
    [J]. SCIENCE CHINA-INFORMATION SCIENCES, 2018, 61 (01)
  • [32] An optimization-based shared control framework with applications in multi-robot systems
    Hao Fang
    Chengsi Shang
    Jie Chen
    [J]. Science China Information Sciences, 2018, 61
  • [33] Partially-shared zero-suppressed multi-terminal BDDs: concept, algorithms and applications
    Kai Lampka
    Markus Siegle
    Joern Ossowski
    Christel Baier
    [J]. Formal Methods in System Design, 2010, 36 : 198 - 222
  • [34] Partially-shared zero-suppressed multi-terminal BDDs: concept, algorithms and applications
    Lampka, Kai
    Siegle, Markus
    Ossowski, Joern
    Baier, Christel
    [J]. FORMAL METHODS IN SYSTEM DESIGN, 2010, 36 (03) : 198 - 222
  • [35] Multi-GPU systems and Unified Virtual Memory for scientific applications: The case of the NAS multi-zone parallel benchmarks
    Gonzalez, Marc
    Morancho, Enric
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2021, 158 : 138 - 150
  • [36] An effective 3-D fast fourier transform framework for multi-GPU accelerated distributed-memory systems
    Zhou, Binbin
    Lu, Lu
    [J]. JOURNAL OF SUPERCOMPUTING, 2022, 78 (15): : 17055 - 17073
  • [37] An effective 3-D fast fourier transform framework for multi-GPU accelerated distributed-memory systems
    Binbin Zhou
    Lu Lu
    [J]. The Journal of Supercomputing, 2022, 78 : 17055 - 17073
  • [38] Multi-GPU multi-resolution SPH framework towards massive hydrodynamics simulations and its applications in high-speed water entry
    Zhao, Zhen-Xi
    Bilotta, Giuseppe
    Yuan, Qin-Er
    Gong, Zhao-Xin
    Liu, Hua
    [J]. JOURNAL OF COMPUTATIONAL PHYSICS, 2023, 490
  • [39] Power-aware scheduling of real-time applications onto MPSoC platforms with multi-bank shared memory
    Nogueira, Bruno
    Andrade, Ermeson
    Tavares, Eduardo
    [J]. MICROPROCESSORS AND MICROSYSTEMS, 2019, 67 : 93 - 102
  • [40] A study of predictable execution models implementation for industrial data-flow applications on a multi-core platform with shared banked memory
    Schuh, Matheus
    Maiza, Claire
    Goossens, Joel
    Raymond, Pascal
    de Dinechin, Benoit Dupont
    [J]. 2020 IEEE 41ST REAL-TIME SYSTEMS SYMPOSIUM (RTSS), 2020, : 283 - 295