Unified On-chip Memory Allocation for SIMT Architecture

被引:12
|
作者
Hayes, Ari B. [1 ]
Zhang, Eddy Z. [1 ]
机构
[1] Rutgers State Univ, Dept Comp Sci, Piscataway, NJ 08554 USA
关键词
GPU; Register Allocation; Shared Memory Allocation; Compiler Optimization; Concurrency;
D O I
10.1145/2597652.2597685
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The popularity of general purpose Graphic Processing Unit (GPU) is largely attributed to the tremendous concurrency enabled by its underlying architecture single instruction multiple thread (SIMT) architecture. It keeps the context of a significant number of threads in registers to enable fast "context switches" when the processor is stalled due to execution dependence, memory requests and etc. The SIMT architecture has a large register file evenly partitioned among all concurrent threads. Per-thread register usage determines the number of concurrent threads, which strongly affects the whole program performance. Existing register allocation techniques, extensively studied in the past several decades, are oblivious to the register contention due to the concurrent execution of many threads. They are prone to making optimization decisions that benefit single thread but degrade the whole application performance. Is it possible for compilers to make register allocation decisions that can maximize the whole GPU application performance? We tackle this important question from two different aspects in this paper. We first propose an unified on-chip memory allocation framework that uses scratch-pad memory to help: (1) alleviate single-thread register pressure; (2) increase whole application throughput. Secondly, we propose a characterization model for the SIMT execution model in order to achieve a desired on-chip memory partition given the register pressure of a program. Overall, we discovered that it is possible to automatically determine an on-chip memory resource allocation that maximizes con currency while ensuring good single-thread performance at compile-time. We evaluated our techniques on a representative set of GPU benchmarks with non-trivial register pressure. We are able to achieve up to 1.70 times speedup over the baseline of the traditional register allocation scheme that maximizes single thread performance.
引用
收藏
页码:293 / 302
页数:10
相关论文
共 50 条
  • [21] An On-Chip Trainable and Scalable In-Memory ANN Architecture for AI/ML Applications
    Abhash Kumar
    Sai Manohar Beeraka
    Jawar Singh
    Bharat Gupta
    Circuits, Systems, and Signal Processing, 2023, 42 : 2828 - 2851
  • [22] An On-Chip Trainable and Scalable In-Memory ANN Architecture for AI/ML Applications
    Kumar, Abhash
    Beeraka, Sai Manohar
    Singh, Jawar
    Gupta, Bharat
    CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2023, 42 (05) : 2828 - 2851
  • [23] Fast and Accurate Code Placement of Embedded Software for Hybrid On-chip Memory Architecture
    Zhou, Zimeng
    Ju, Lei
    Jia, Zhiping
    Li, Xin
    2014 IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS, 2014 IEEE 6TH INTL SYMP ON CYBERSPACE SAFETY AND SECURITY, 2014 IEEE 11TH INTL CONF ON EMBEDDED SOFTWARE AND SYST (HPCC,CSS,ICESS), 2014, : 1008 - 1015
  • [24] The Organization of On-Chip Data Memory in One Coarse-Grained Reconfigurable Architecture
    Wang, Yansheng
    Liu, Leibo
    Yin, Shouyi
    Zhu, Min
    Cao, Peng
    Yang, Jun
    Wei, Shaojun
    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES, 2013, E96A (11) : 2218 - 2229
  • [25] On the Reliability of FeFET On-Chip Memory
    Genssler, Paul R.
    van Santen, Victor M.
    Henkel, Joerg
    Amrouch, Hussam
    IEEE TRANSACTIONS ON COMPUTERS, 2022, 71 (04) : 947 - 958
  • [26] On-chip cache memory resilience
    Hwang, SH
    Choi, GS
    THIRD IEEE INTERNATIONAL HIGH-ASSURANCE SYSTEMS ENGINEERING SYMPOSIUM, PROCEEDINGS, 1998, : 240 - 247
  • [27] A fast on-chip profiler memory
    Lysecky, R
    Cotterell, S
    Vahid, F
    39TH DESIGN AUTOMATION CONFERENCE, PROCEEDINGS 2002, 2002, : 28 - 33
  • [28] A Unified Framework for Error Correction in On-Chip Memories
    Sala, Frederic
    Duwe, Henry
    Dolecek, Lara
    Kumar, Rakesh
    2016 46TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS WORKSHOPS (DSN-W), 2016, : 268 - 274
  • [29] Unified Inductance Calculations for On-Chip Planar Spirals
    Xie, Shuangwen
    Fu, Jun
    2022 29TH IEEE INTERNATIONAL CONFERENCE ON ELECTRONICS, CIRCUITS AND SYSTEMS (IEEE ICECS 2022), 2022,
  • [30] An architecture and compiler for scalable on-chip communication
    Liang, H
    Laffely, A
    Srinivasan, S
    Tessier, R
    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2004, 12 (07) : 711 - 726