Improving the Scalability of GPU Synchronization Primitives

被引：2

作者：

Dalmia, Preyesh ^{[1
]}

Mahapatra, Rohan ^{[2
]}

Intan, Jeremy ^{[3
]}

Negrut, Dan ^{[1
]}

Sinclair, Matthew D. D. ^{[1
,4
]}

机构：

[1] Univ Wisconsin, Madison, WI 53706 USA

[2] Univ Calif San Diego, La Jolla, CA 92093 USA

[3] Univ Illinois, Champaign, IL 61820 USA

[4] AMD Res, Santa Clara, CA 95054 USA

来源：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS | 2023年 / 34卷 / 01期

关键词：

Synchronization; Graphics processing units; Instruction sets; Scalability; Kernel; Coherence; Message systems; BARRIER SYNCHRONIZATION; ALGORITHMS;

D O I：

10.1109/TPDS.2022.3218508

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

General-purpose GPU applications increasingly use synchronization to enforce ordering between many threads accessing shared data. Accordingly, recently there has been a push to establish a common set of GPU synchronization primitives. However, the expressiveness of existing GPU synchronization primitives is limited. In particular the expensive GPU atomics often used to implement fine-grained synchronization make it challenging to implement efficient algorithms. Consequently, as GPU algorithms scale to millions or billions of threads, existing GPU synchronization primitives either scale poorly or suffer from livelock or deadlock issues because of heavy contention between threads accessing shared synchronization objects. We seek to overcome these inefficiencies by designing more efficient, scalable GPU barriers and semaphores. In particular, we show how multi-level sense reversing barriers and priority mechanisms for semaphores can be designed with the GPUs unique processing model in mind to improve performance and scalability of GPU synchronization primitives. Our results show that the proposed designs significantly improve performance compared to state-of-the-art solutions like CUDA Cooperative Groups and optimized CPU-style synchronization algorithms at medium and high contention levels, scale to an order of magnitude more threads, and avoid livelock in these situations unlike prior open source algorithms. Overall, across three modern GPUs the proposed barrier algorithm improves performance by an average of 33% over a GPU tree barrier algorithm and improves performance by an average of 34% over CUDA Cooperative Groups for five full-sized benchmarks at high contention levels; the new semaphore algorithm improves performance by an average of 83% compared to prior GPU semaphores.

引用

页码：275 / 290

页数：16

共 50 条

[31] Automatic Synchronization for GPU Kernels
Anand, Sourav
Polikarpova, Nadia
PROCEEDINGS OF THE 2018 18TH CONFERENCE ON FORMAL METHODS IN COMPUTER AIDED DESIGN (FMCAD), 2018, : 85 - 93
[32] On improving MPEG spatial scalability
Domanski, M
Luczak, A
Mackowiak, S
2000 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL II, PROCEEDINGS, 2000, : 848 - 851
[33] Improving the network scalability of Erlang
Chechina, Natalia
Li, Huiqing
Ghaffari, Amir
Thompson, Simon
Trinder, Phil
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2016, 90-91 : 22 - 34
[34] Improving scalability in systems neuroscience
Chen, Zhe Sage
Pesaran, Bijan
NEURON, 2021, 109 (11) : 1776 - 1790
[35] On the Performance and Scalability of a GPU-Limited Commodity Cluster
Williams, Jorge Luis
Hiromoto, Robert E.
ADVANCES IN VISUAL COMPUTING, PT II, PROCEEDINGS, 2008, 5359 : 1044 - 1055
[36] Improving Downlink Scalability in LoRaWAN
Di Vincenzo, Valentina
Heusse, Martin
Tourancheau, Bernard
ICC 2019 - 2019 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC), 2019,
[37] Improving agility through scalability
Shanley, Agnes
1600, UBM Medica Healthcare Publications (44): : 34 - 37
[38] Improving the scalability of automatic programming
Berg, H
Olsson, R
COMPUTATIONAL INTELLIGENCE AND SECURITY, PT 1, PROCEEDINGS, 2005, 3801 : 17 - 24
[39] Hierarchical Rasterization of Curved Primitives for Vector Graphics Rendering on the GPU
Dokter, Mark
Hladky, Jozef
Parger, Mathias
Schmalstieg, Dieter
Seidel, Hans-Peter
Steinberger, Markus
COMPUTER GRAPHICS FORUM, 2019, 38 (02) : 93 - 103
[40] Primitives Enhancing GPU Runtime Support for Improved DNN Performance
Dhakal, Aditya
Kulkarni, Sameer G.
Ramakrishnan, K. K.
2021 IEEE 14TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING (CLOUD 2021), 2021, : 53 - 64

← 1 2 3 4 5 →