Improving the Scalability of GPU Synchronization Primitives

被引:2
|
作者
Dalmia, Preyesh [1 ]
Mahapatra, Rohan [2 ]
Intan, Jeremy [3 ]
Negrut, Dan [1 ]
Sinclair, Matthew D. D. [1 ,4 ]
机构
[1] Univ Wisconsin, Madison, WI 53706 USA
[2] Univ Calif San Diego, La Jolla, CA 92093 USA
[3] Univ Illinois, Champaign, IL 61820 USA
[4] AMD Res, Santa Clara, CA 95054 USA
关键词
Synchronization; Graphics processing units; Instruction sets; Scalability; Kernel; Coherence; Message systems; BARRIER SYNCHRONIZATION; ALGORITHMS;
D O I
10.1109/TPDS.2022.3218508
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
General-purpose GPU applications increasingly use synchronization to enforce ordering between many threads accessing shared data. Accordingly, recently there has been a push to establish a common set of GPU synchronization primitives. However, the expressiveness of existing GPU synchronization primitives is limited. In particular the expensive GPU atomics often used to implement fine-grained synchronization make it challenging to implement efficient algorithms. Consequently, as GPU algorithms scale to millions or billions of threads, existing GPU synchronization primitives either scale poorly or suffer from livelock or deadlock issues because of heavy contention between threads accessing shared synchronization objects. We seek to overcome these inefficiencies by designing more efficient, scalable GPU barriers and semaphores. In particular, we show how multi-level sense reversing barriers and priority mechanisms for semaphores can be designed with the GPUs unique processing model in mind to improve performance and scalability of GPU synchronization primitives. Our results show that the proposed designs significantly improve performance compared to state-of-the-art solutions like CUDA Cooperative Groups and optimized CPU-style synchronization algorithms at medium and high contention levels, scale to an order of magnitude more threads, and avoid livelock in these situations unlike prior open source algorithms. Overall, across three modern GPUs the proposed barrier algorithm improves performance by an average of 33% over a GPU tree barrier algorithm and improves performance by an average of 34% over CUDA Cooperative Groups for five full-sized benchmarks at high contention levels; the new semaphore algorithm improves performance by an average of 83% compared to prior GPU semaphores.
引用
收藏
页码:275 / 290
页数:16
相关论文
共 50 条
  • [41] Gossip: Efficient Communication Primitives for Multi-GPU Systems
    Kobus, Robin
    Juenger, Daniel
    Hundt, Christian
    Schmidt, Bertil
    PROCEEDINGS OF THE 48TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP 2019), 2019,
  • [42] Learning and Synchronization of Movement Primitives for Bimanual Manipulation Tasks
    Thota, Pavan Kumar
    Ravichandar, Harish Chaandar
    Dani, Ashwin P.
    2016 IEEE 55TH CONFERENCE ON DECISION AND CONTROL (CDC), 2016, : 945 - 950
  • [43] Adaptive Hybrid Synchronization Primitives: A Reinforcement Learning Approach
    Ganjaliyev, Fadai
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (05) : 51 - 57
  • [44] Efficient Bounded Timestamping from Standard Synchronization Primitives
    Bashari, Benyamin
    Jamadi, Ali
    Woelfel, Philipp
    PROCEEDINGS OF THE 2023 ACM SYMPOSIUM ON PRINCIPLES OF DISTRIBUTED COMPUTING, PODC 2023, 2023, : 113 - 123
  • [45] COMMUTATION RELATIONS OF SLICES CHARACTERIZE SOME SYNCHRONIZATION PRIMITIVES
    DOLEV, D
    SHAMIR, E
    INFORMATION PROCESSING LETTERS, 1978, 7 (01) : 7 - 9
  • [46] Scalable NUMA-aware Blocking Synchronization Primitives
    Kashyap, Sanidhya
    Mm, Changwoo
    Kim, Taesoo
    2017 USENIX ANNUAL TECHNICAL CONFERENCE (USENIX ATC '17), 2017, : 603 - 615
  • [47] On the Performance of Open-Source RTOS Synchronization Primitives
    Bertolotti, Ivan Cibrario
    Kashani, Gilda Ghafour Zadeh
    2015 IEEE 1ST INTERNATIONAL FORUM ON RESEARCH AND TECHNOLOGIES FOR SOCIETY AND INDUSTRY (RTSI 2015) PROCEEDINGS, 2015,
  • [48] Performance and Scalability of GPU-based Convolutional Neural Networks
    Strigl, Daniel
    Kofler, Klaus
    Podlipnig, Stefan
    PROCEEDINGS OF THE 18TH EUROMICRO CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING, 2010, : 317 - 324
  • [49] Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation
    Wu, Haicheng
    Diamos, Gregory
    Cadambi, Srihari
    Yalamanchili, Sudhakar
    2012 IEEE/ACM 45TH INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO-45), 2012, : 107 - 118
  • [50] A Scheduling Algorithm for Improving Scalability of LoRaWAN
    Lee, Junhee
    Jeong, Wun-Cheol
    Choi, Byeong-Cheol
    2018 INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGY CONVERGENCE (ICTC), 2018, : 1383 - 1388