Unified On-chip Memory Allocation for SIMT Architecture

被引:12
|
作者
Hayes, Ari B. [1 ]
Zhang, Eddy Z. [1 ]
机构
[1] Rutgers State Univ, Dept Comp Sci, Piscataway, NJ 08554 USA
关键词
GPU; Register Allocation; Shared Memory Allocation; Compiler Optimization; Concurrency;
D O I
10.1145/2597652.2597685
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The popularity of general purpose Graphic Processing Unit (GPU) is largely attributed to the tremendous concurrency enabled by its underlying architecture single instruction multiple thread (SIMT) architecture. It keeps the context of a significant number of threads in registers to enable fast "context switches" when the processor is stalled due to execution dependence, memory requests and etc. The SIMT architecture has a large register file evenly partitioned among all concurrent threads. Per-thread register usage determines the number of concurrent threads, which strongly affects the whole program performance. Existing register allocation techniques, extensively studied in the past several decades, are oblivious to the register contention due to the concurrent execution of many threads. They are prone to making optimization decisions that benefit single thread but degrade the whole application performance. Is it possible for compilers to make register allocation decisions that can maximize the whole GPU application performance? We tackle this important question from two different aspects in this paper. We first propose an unified on-chip memory allocation framework that uses scratch-pad memory to help: (1) alleviate single-thread register pressure; (2) increase whole application throughput. Secondly, we propose a characterization model for the SIMT execution model in order to achieve a desired on-chip memory partition given the register pressure of a program. Overall, we discovered that it is possible to automatically determine an on-chip memory resource allocation that maximizes con currency while ensuring good single-thread performance at compile-time. We evaluated our techniques on a representative set of GPU benchmarks with non-trivial register pressure. We are able to achieve up to 1.70 times speedup over the baseline of the traditional register allocation scheme that maximizes single thread performance.
引用
收藏
页码:293 / 302
页数:10
相关论文
共 50 条
  • [1] Efficient exploration of on-chip bus architectures and memory allocation
    Kim, S
    Im, C
    Ha, SH
    INTERNATIONAL CONFERENCE ON HARDWARE/SOFTWARE CODESIGN AND SYSTEM SYNTHESIS, 2004, : 248 - 253
  • [2] THE CACHE DRAM ARCHITECTURE - A DRAM WITH AN ON-CHIP CACHE MEMORY
    HIDAKA, H
    MATSUDA, Y
    ASAKURA, M
    FUJISHIMA, K
    IEEE MICRO, 1990, 10 (02) : 14 - 25
  • [3] 3D On-Chip Memory for the Vector Architecture
    Funaya, Yusuke
    Egawa, Ryusuke
    Takizawa, Hiroyuki
    Kobayashi, Hiroaki
    2009 IEEE INTERNATIONAL CONFERENCE ON 3D SYSTEMS INTEGRATION, 2009, : 352 - 357
  • [4] Thermal-Aware On-Chip Memory Architecture Exploration
    Li, Yang
    Ju, Lei
    Jia, Zhiping
    Wang, Yi
    Shao, Zili
    2013 12TH IEEE INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS (TRUSTCOM 2013), 2013, : 1386 - 1393
  • [5] A Novel Energy-Oriented Reconfigurable on-chip Unified Memory Architecture Based on Cache Behavior Phase Graph
    Wu Jianping
    Ling Ming
    Zhang Yang
    Mei Chen
    Wang Huan
    2013 IEEE 10TH INTERNATIONAL CONFERENCE ON ASIC (ASICON), 2013,
  • [6] Optimizing code allocation for hybrid on-chip memory in IoT systems
    Sun, Zhe
    Zhou, Zimeng
    Fu, Fang-Wei
    INTEGRATION-THE VLSI JOURNAL, 2024, 97
  • [7] A preliminary study on data allocation of on-chip dual memory banks
    Cho, J
    Kim, J
    Paek, Y
    SIXTH ANNUAL WORKSHOP ON INTERACTION BETWEEN COMPILERS AND COMPUTER ARCHITECTURES, PROCEEDINGS, 2002, : 68 - 76
  • [8] Optimization techniques of On-chip Memory System Based on UltraSPARC Architecture
    Huang, Anwen
    Gao, Jun
    Feng, Chaochao
    Zhang, Minxuan
    2009 ASIA PACIFIC CONFERENCE ON POSTGRADUATE RESEARCH IN MICROELECTRONICS AND ELECTRONICS (PRIMEASIA 2009), 2009, : 428 - 431
  • [9] Unified model for on-chip interconnects
    Yu, S
    Sim, SP
    Krishnan, S
    Petranovic, DM
    Lee, K
    Yang, CY
    2004: 7TH INTERNATIONAL CONFERENCE ON SOLID-STATE AND INTEGRATED CIRCUITS TECHNOLOGY, VOLS 1- 3, PROCEEDINGS, 2004, : 1026 - 1031
  • [10] TelaMalloc: Efficient On-Chip Memory Allocation for Production Machine Learning Accelerators
    Maas, Martin
    Beaugnon, Ulysse
    Chauhan, Arun
    Ilbeyi, Berkin
    PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS, VOL 1, ASPLOS 2023, 2023, : 123 - 137