Multi-GPU systems and Unified Virtual Memory for scientific applications: The case of the NAS multi-zone parallel benchmarks

被引：4

作者：

Gonzalez, Marc ^{[1
]}

Morancho, Enric ^{[1
]}

机构：

[1] Univ Politecn Catalunya BarcelonaTECH, Dept Comp Architecture, Barcelona, Spain

来源：

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING | 2021年 / 158卷

关键词：

Multi-GPU; Unified Virtual Memory; Single address space; NAS parallel benchmarks;

D O I：

10.1016/j.jpdc.2021.08.001

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

GPU-based computing systems have become a widely accepted solution for the high-performance-computing (HPC) domain. GPUs have shown highly competitive performance-per-watt ratios and can exploit an astonishing level of parallelism. However, exploiting the peak performance of such devices is a challenge, mainly due to the combination of two essential aspects of multi-GPU execution: memory allocation and work distribution. Memory allocation determines the data mapping to GPUs, and therefore conditions all work distribution schemes and communication phases in the application. Unified Virtual Memory simplifies the codification of memory allocations, but its effects on performance depend on how data is used by the devices and how the devices' driver is going to orchestrate the data transfers across the system. In this paper we present a multi-GPU and Unified Virtual Memory (UM) implementation of the NAS Multi-Zone Parallel Benchmarks which alternate communication and computation phases offering opportunities to overlap these phases. We analyse the programmability and performance effects of the introduction of the UM support. Our experience shows that the programming efforts for introducing UM are similar to those of having a memory allocation per GPU. On an evaluation environment composed of 2 x IBM Power9 8335-GTH and 4 x GPU NVIDIA V100 (Volta), our UM-based parallelization outperforms the manual memory allocation versions by 1.10x to 1.85x. However, these improvements are highly sensitive to the information forwarded to the devices' driver describing the most convenient location for specific memory regions. We analyse these improvements in terms of the relationship between the computational and communication phases of the applications. (C) 2021 The Author(s). Published by Elsevier Inc.

引用

页码：138 / 150

页数：13

共 22 条

[1] Multi-GPU Parallelization of the NAS Multi-Zone Parallel Benchmarks
Gonzalez, Marc
Morancho, Enric
[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2021, 32 (01) : 229 - 241
[2] Performance characteristics of the multi-zone NAS parallel benchmarks
Jin, HQ
Van der Wijngaart, RF
[J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2006, 66 (05) : 674 - 685
[3] Memory Harvesting in Multi-GPU Systems with Hierarchical Unified Virtual Memory
Choi, Sangjin
Kim, Taeksoo
Jeong, Jinwoo
Ausavarungnirun, Rachata
Jeon, Myeongjae
Kwon, Youngjin
Ahn, Jeongseob
[J]. PROCEEDINGS OF THE 2022 USENIX ANNUAL TECHNICAL CONFERENCE, 2022, : 625 - 638
[4] Benchmarking multi-GPU applications on modern multi-GPU integrated systems
Bernaschi, Massimo
Agostini, Elena
Rossetti, Davide
[J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2021, 33 (14):
[5] Data Parallel Skeletons for GPU Clusters and Multi-GPU Systems
Ernsting, Steffen
Kuchen, Herbert
[J]. APPLICATIONS, TOOLS AND TECHNIQUES ON THE ROAD TO EXASCALE COMPUTING, 2012, 22 : 509 - 518
[6] Sphynx: A parallel multi-GPU graph partitioner for distributed-memory systems
Acer, Seher
Boman, Erik G.
Glusa, Christian A.
Rajamanickam, Sivasankaran
[J]. PARALLEL COMPUTING, 2021, 106
[7] Performance Analysis of Parallel FFT on Large Multi-GPU Systems
Ayala, Alan
Tomov, Stan
Stoyanov, Miroslav
Haidar, Azzam
Dongarra, Jack
[J]. 2022 IEEE 36TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW 2022), 2022, : 372 - 381
[8] Parallel Algorithm for Landform Attributes Representation on Multicore and Multi-GPU Systems
Boratto, Murilo
Alonso, Pedro
Ramiro, Carla
Barreto, Marcos
Coelho, Leandro
[J]. COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2012, PT I, 2012, 7333 : 29 - 43
[9] GPU-Chariot: A Programming Framework for Stream Applications Running on Multi-GPU Systems
Ino, Fumihiko
Nakagawa, Shinta
Hagihara, Kenichi
[J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2013, E96D (12): : 2604 - 2616
[10] Scalable Framework for Mapping Streaming Applications onto Multi-GPU Systems
Huynh, Huynh Phung
Hagiescu, Andrei
Wong, Weng-Fai
Goh, Rick Siow Mong
[J]. ACM SIGPLAN NOTICES, 2012, 47 (08) : 1 - 10

← 1 2 3 →