Improving the Scalability of GPU Synchronization Primitives

被引:2
|
作者
Dalmia, Preyesh [1 ]
Mahapatra, Rohan [2 ]
Intan, Jeremy [3 ]
Negrut, Dan [1 ]
Sinclair, Matthew D. D. [1 ,4 ]
机构
[1] Univ Wisconsin, Madison, WI 53706 USA
[2] Univ Calif San Diego, La Jolla, CA 92093 USA
[3] Univ Illinois, Champaign, IL 61820 USA
[4] AMD Res, Santa Clara, CA 95054 USA
关键词
Synchronization; Graphics processing units; Instruction sets; Scalability; Kernel; Coherence; Message systems; BARRIER SYNCHRONIZATION; ALGORITHMS;
D O I
10.1109/TPDS.2022.3218508
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
General-purpose GPU applications increasingly use synchronization to enforce ordering between many threads accessing shared data. Accordingly, recently there has been a push to establish a common set of GPU synchronization primitives. However, the expressiveness of existing GPU synchronization primitives is limited. In particular the expensive GPU atomics often used to implement fine-grained synchronization make it challenging to implement efficient algorithms. Consequently, as GPU algorithms scale to millions or billions of threads, existing GPU synchronization primitives either scale poorly or suffer from livelock or deadlock issues because of heavy contention between threads accessing shared synchronization objects. We seek to overcome these inefficiencies by designing more efficient, scalable GPU barriers and semaphores. In particular, we show how multi-level sense reversing barriers and priority mechanisms for semaphores can be designed with the GPUs unique processing model in mind to improve performance and scalability of GPU synchronization primitives. Our results show that the proposed designs significantly improve performance compared to state-of-the-art solutions like CUDA Cooperative Groups and optimized CPU-style synchronization algorithms at medium and high contention levels, scale to an order of magnitude more threads, and avoid livelock in these situations unlike prior open source algorithms. Overall, across three modern GPUs the proposed barrier algorithm improves performance by an average of 33% over a GPU tree barrier algorithm and improves performance by an average of 34% over CUDA Cooperative Groups for five full-sized benchmarks at high contention levels; the new semaphore algorithm improves performance by an average of 83% compared to prior GPU semaphores.
引用
收藏
页码:275 / 290
页数:16
相关论文
共 50 条
  • [1] Scalability Techniques for Practical Synchronization Primitives
    Bueso, Davidlohr
    COMMUNICATIONS OF THE ACM, 2015, 58 (01) : 66 - 74
  • [2] Improving Scalability with GPU-Aware Asynchronous Tasks
    Choi, Jaemin
    Richards, David F.
    Kale, Laxmikant, V
    2022 IEEE 36TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW 2022), 2022, : 569 - 578
  • [3] Improving the Scalability of Transparent Checkpointing for GPU Computing Systems
    Amrizal, Alfian
    Hirasawa, Shoichi
    Komatsu, Kazuhiko
    Takizawa, Hiroyuki
    Kobayashi, Hiroaki
    TENCON 2012 - 2012 IEEE REGION 10 CONFERENCE: SUSTAINABLE DEVELOPMENT THROUGH HUMANITARIAN TECHNOLOGY, 2012,
  • [4] Improving GPU Memory Performance with Artificial Barrier Synchronization
    Lo, Shih-Hsiang
    Lee, Che-Rung
    Kao, Quey-Liang
    Chung, I-Hsin
    Chung, Yeh-Ching
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2014, 25 (09) : 2342 - 2352
  • [5] Study of GPU Scalability
    Choi, Kyu Hyun
    Jung, Dongha
    Kim, Seon Wook
    Hwang, Tae-ho
    Kwon, Jin-san
    Kim, Dong-Sun
    18TH IEEE INTERNATIONAL SYMPOSIUM ON CONSUMER ELECTRONICS (ISCE 2014), 2014,
  • [6] Scan Primitives for GPU Computing
    Sengupta, Shubhabrata
    Harris, Mark
    Zhang, Yao
    Owens, John D.
    GRAPHICS HARDWARE 2007: ACM SIGGRAPH / EUROGRAPHICS SYMPOSIUM PROCEEDINGS, 2007, : 97 - +
  • [7] IMPLICIT COMPUTATION OF SYNCHRONIZATION PRIMITIVES
    DEMILLO, RA
    MILLER, RE
    INFORMATION PROCESSING LETTERS, 1979, 9 (01) : 35 - 38
  • [8] AN AXIOMATIC DEFINITION OF SYNCHRONIZATION PRIMITIVES
    MARTIN, AJ
    ACTA INFORMATICA, 1981, 16 (02) : 219 - 235
  • [9] Visualization of Industrial Structures with Implicit GPU Primitives
    de Toledo, Rodrigo
    Levy, Bruno
    ADVANCES IN VISUAL COMPUTING, PT I, PROCEEDINGS, 2008, 5358 : 139 - 150
  • [10] LEAST SQUARES FITTING OF ANALYTIC PRIMITIVES ON A GPU
    Panyam, Meghashyam
    Kurfess, Thomas R.
    Tucker, Thomas M.
    PROCEEDINGS OF THE 9TH BIENNIAL CONFERENCE ON ENGINEERING SYSTEMS DESIGN AND ANALYSIS - 2008, VOL 1, 2009, : 233 - 240