Improving the Scalability of GPU Synchronization Primitives

被引:2
|
作者
Dalmia, Preyesh [1 ]
Mahapatra, Rohan [2 ]
Intan, Jeremy [3 ]
Negrut, Dan [1 ]
Sinclair, Matthew D. D. [1 ,4 ]
机构
[1] Univ Wisconsin, Madison, WI 53706 USA
[2] Univ Calif San Diego, La Jolla, CA 92093 USA
[3] Univ Illinois, Champaign, IL 61820 USA
[4] AMD Res, Santa Clara, CA 95054 USA
关键词
Synchronization; Graphics processing units; Instruction sets; Scalability; Kernel; Coherence; Message systems; BARRIER SYNCHRONIZATION; ALGORITHMS;
D O I
10.1109/TPDS.2022.3218508
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
General-purpose GPU applications increasingly use synchronization to enforce ordering between many threads accessing shared data. Accordingly, recently there has been a push to establish a common set of GPU synchronization primitives. However, the expressiveness of existing GPU synchronization primitives is limited. In particular the expensive GPU atomics often used to implement fine-grained synchronization make it challenging to implement efficient algorithms. Consequently, as GPU algorithms scale to millions or billions of threads, existing GPU synchronization primitives either scale poorly or suffer from livelock or deadlock issues because of heavy contention between threads accessing shared synchronization objects. We seek to overcome these inefficiencies by designing more efficient, scalable GPU barriers and semaphores. In particular, we show how multi-level sense reversing barriers and priority mechanisms for semaphores can be designed with the GPUs unique processing model in mind to improve performance and scalability of GPU synchronization primitives. Our results show that the proposed designs significantly improve performance compared to state-of-the-art solutions like CUDA Cooperative Groups and optimized CPU-style synchronization algorithms at medium and high contention levels, scale to an order of magnitude more threads, and avoid livelock in these situations unlike prior open source algorithms. Overall, across three modern GPUs the proposed barrier algorithm improves performance by an average of 33% over a GPU tree barrier algorithm and improves performance by an average of 34% over CUDA Cooperative Groups for five full-sized benchmarks at high contention levels; the new semaphore algorithm improves performance by an average of 83% compared to prior GPU semaphores.
引用
收藏
页码:275 / 290
页数:16
相关论文
共 50 条
  • [21] Scalability of multicast based synchronization methods
    Schnor, B
    Petri, S
    Becker, M
    24TH EUROMICRO CONFERENCE - PROCEEDING, VOLS 1 AND 2, 1998, : 969 - 975
  • [22] Least-squares fitting of analytic primitives on a GPU
    Ram, Meghashyam Panyam Mohan
    Kurfess, Thomas R.
    Tucker, Thomas M.
    JOURNAL OF MANUFACTURING SYSTEMS, 2008, 27 (03) : 130 - 135
  • [23] On the scalability of routing integrated time synchronization
    Sallai, J
    Kusy, B
    Lédeczi, A
    Dutta, P
    WIRELESS SENSOR NETWORKS, PROCEEDINGS, 2006, 3868 : 115 - 131
  • [24] Scalability, locality, partitioning and synchronization in PDES
    Nicol, DM
    TWELFTH WORKSHOP ON PARALLEL AND DISTRIBUTED SIMULATION - PADS'98, PROCEEDINGS, 1998, : 4 - 11
  • [25] Towards more powerful and flexible synchronization primitives
    Borkowski, J
    INTERNATIONAL CONFERENCE ON PARALLEL COMPUTING IN ELECTRICAL ENGINEERING - PARELEC 2000, PROCEEDINGS, 2000, : 18 - 22
  • [26] ANOTHER APPROACH TO THE IMPLEMENTATION OF SYNCHRONIZATION PRIMITIVES.
    Hoppe, Jiri
    Software - Practice and Experience, 1986, 16 (12) : 1109 - 1116
  • [27] Efficient Implementation of GPGPU Synchronization Primitives on CPUs
    Gummaraju, Jayanth
    Sander, Ben
    Morichetti, Laurent
    Gaster, Benedict
    Howes, Lee
    PROCEEDINGS OF THE 2010 COMPUTING FRONTIERS CONFERENCE (CF 2010), 2010, : 85 - 86
  • [28] PERFORMANCE EVALUATION OF FORK AND JOIN SYNCHRONIZATION PRIMITIVES
    DUDA, A
    CZACHORSKI, T
    ACTA INFORMATICA, 1987, 24 (05) : 525 - 553
  • [29] On the Importance of Synchronization Primitives with Low Consensus Numbers
    Khanchandani, Pankaj
    Wattenhofer, Roger
    ICDCN'18: PROCEEDINGS OF THE 19TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING AND NETWORKING, 2018,
  • [30] Application-Informed Kernel Synchronization Primitives
    Park, Sujin
    Zhou, Diyu
    Qian, Yuchen
    Calciu, Irina
    Kim, Taesoo
    Kashyap, Sanidhya
    PROCEEDINGS OF THE 16TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, OSDI 2022, 2022, : 667 - 682