Unified On-chip Memory Allocation for SIMT Architecture

被引:12
|
作者
Hayes, Ari B. [1 ]
Zhang, Eddy Z. [1 ]
机构
[1] Rutgers State Univ, Dept Comp Sci, Piscataway, NJ 08554 USA
关键词
GPU; Register Allocation; Shared Memory Allocation; Compiler Optimization; Concurrency;
D O I
10.1145/2597652.2597685
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The popularity of general purpose Graphic Processing Unit (GPU) is largely attributed to the tremendous concurrency enabled by its underlying architecture single instruction multiple thread (SIMT) architecture. It keeps the context of a significant number of threads in registers to enable fast "context switches" when the processor is stalled due to execution dependence, memory requests and etc. The SIMT architecture has a large register file evenly partitioned among all concurrent threads. Per-thread register usage determines the number of concurrent threads, which strongly affects the whole program performance. Existing register allocation techniques, extensively studied in the past several decades, are oblivious to the register contention due to the concurrent execution of many threads. They are prone to making optimization decisions that benefit single thread but degrade the whole application performance. Is it possible for compilers to make register allocation decisions that can maximize the whole GPU application performance? We tackle this important question from two different aspects in this paper. We first propose an unified on-chip memory allocation framework that uses scratch-pad memory to help: (1) alleviate single-thread register pressure; (2) increase whole application throughput. Secondly, we propose a characterization model for the SIMT execution model in order to achieve a desired on-chip memory partition given the register pressure of a program. Overall, we discovered that it is possible to automatically determine an on-chip memory resource allocation that maximizes con currency while ensuring good single-thread performance at compile-time. We evaluated our techniques on a representative set of GPU benchmarks with non-trivial register pressure. We are able to achieve up to 1.70 times speedup over the baseline of the traditional register allocation scheme that maximizes single thread performance.
引用
收藏
页码:293 / 302
页数:10
相关论文
共 50 条
  • [41] XOMA: Exclusive On-Chip Memory Architecture for Energy-Efficient Deep Learning Acceleration
    Sim, Hyeonuk
    Anderson, Jason H.
    Lee, Jongeun
    24TH ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE (ASP-DAC 2019), 2019, : 651 - 656
  • [42] Flow-Aware Allocation for On-Chip Networks
    Banerjee, Arnab
    Moore, Simon W.
    2009 3RD ACM/IEEE INTERNATIONAL SYMPOSIUM ON NETWORKS-ON-CHIP, 2009, : 183 - 192
  • [43] Reliable On-chip Memory Design for CMPs
    BanaiyanMofrad, Abbas
    2012 31ST INTERNATIONAL SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS (SRDS 2012), 2012, : 487 - 488
  • [44] Thermally tuned on-chip optical memory
    Bose, Anindya
    Chakraborty, Rajib
    PHYSICA SCRIPTA, 2024, 99 (09)
  • [45] On-chip flash memory microcomputers and their applications
    Tatezaki, Jun'ichi
    Asakami, Hiroaki
    Watanabe, Terukazu
    Hitachi Review, 1999, 48 (02): : 64 - 67
  • [46] A superconductive flash digitizer with on-chip memory
    Kaplan, SB
    Bradley, PD
    Brock, DK
    Gaidarenko, D
    Gupta, D
    Li, WQ
    Rylov, SV
    IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, 1999, 9 (02) : 3020 - 3025
  • [47] Memory chip BIST architecture
    Savir, J
    NINTH GREAT LAKES SYMPOSIUM ON VLSI, PROCEEDINGS, 1999, : 384 - 385
  • [48] A simplicial CNN architecture for on-chip image processing
    Mandolesi, PS
    Julian, P
    Andreou, AG
    2004 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, VOL 3, PROCEEDINGS, 2004, : 29 - 32
  • [49] Architecture of the on-chip debug module for a multiprocessor system
    Zhang, Kexin
    Yu, Jian
    CIVIL, ARCHITECTURE AND ENVIRONMENTAL ENGINEERING, VOLS 1 AND 2, 2017, : 1505 - 1509
  • [50] A RAM ARCHITECTURE FOR CONCURRENT ACCESS AND ON-CHIP TESTING
    LIU, JC
    SHIN, KG
    IEEE TRANSACTIONS ON COMPUTERS, 1991, 40 (10) : 1153 - 1159