Efficient Management of Cache Accesses to Boost GPGPU Memory Subsystem Performance

被引:0
|
作者
Candel, Francisco [1 ]
Valero, Alejandro [2 ]
Petit, Salvador [1 ]
Sahuquillo, Julio [1 ]
机构
[1] Univ Politecn Valencia, Dept Comp Engn, E-46022 Valencia, Spain
[2] Univ Zaragoza, Inst Univ Ingn Aragon, Dept Informat & Ingn Sistemas, E-50009 Zaragoza, Spain
关键词
GPU; memory hierarchy; miss management;
D O I
10.1109/TC.2019.2907591
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
To support the massive amount of memory accesses that GPGPU applications generate, GPU memory hierarchies are becoming more and more complex, and the Last Level Cache (LLC) size considerably increases each GPU generation. This paper shows that counter-intuitively, enlarging the LLC brings marginal performance gains in most applications. In other words, increasing the LLC size does not scale neither in performance nor energy consumption. We examine how LLC misses are managed in typical GPUs, and we find that in most cases the way LLC misses are managed are precisely the main performance limiter. This paper proposes a novel approach that addresses this shortcoming by leveraging a tiny additional Fetch and Replacement Cache-like structure (FRC) that stores control and coherence information of the incoming blocks until they are fetched from main memory. Then, the fetched blocks are swapped with the victim blocks (i.e., selected to be replaced) in the LLC, and the eviction of such victim blocks is performed from the FRC. This approach improves performance due to three main reasons: i) the lifetime of blocks being replaced is enlarged, ii) the main memory path is unclogged on long bursts of LLC misses, and iii) the average LLC miss latency is reduced. The proposal improves the LLC hit ratio, memory-level parallelism, and reduces the miss latency compared to much larger conventional caches. Moreover, this is achieved with reduced energy consumption and with much less area requirements. Experimental results show that the proposed FRC cache scales in performance with the number of GPU compute units and the LLC size, since, depending on the FRC size, performance improves ranging from 30 to 67 percent for a modern baseline GPU card, and from 32 to 118 percent for a larger GPU. In addition, energy consumption is reduced on average from 49 to 57 percent for the larger GPU. These benefits come with a small area increase (by 7.3 percent) over the LLC baseline.
引用
收藏
页码:1442 / 1454
页数:13
相关论文
共 50 条
  • [21] Automatic Sublining for Efficient Sparse Memory Accesses
    Heirman, Wim
    Eyerman, Stijn
    Du Bois, Kristof
    Hur, Ibrahim
    [J]. ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2021, 18 (03)
  • [22] Efficient Hashing with Lookups in two Memory Accesses
    Panigrahy, Rina
    [J]. PROCEEDINGS OF THE SIXTEENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, 2005, : 830 - 839
  • [23] Irregular accesses reorder unit: improving GPGPU memory coalescing for graph-based workloads
    Albert Segura
    Jose Maria Arnau
    Antonio Gonzalez
    [J]. The Journal of Supercomputing, 2023, 79 : 762 - 787
  • [24] Irregular accesses reorder unit: improving GPGPU memory coalescing for graph-based workloads
    Segura, Albert
    Arnau, Jose Maria
    Gonzalez, Antonio
    [J]. JOURNAL OF SUPERCOMPUTING, 2023, 79 (01): : 762 - 787
  • [25] Smart-Cache: Optimising Memory Accesses for Arbitrary Boundaries and Stencils on FPGAs
    Nabi, Syed Waqar
    Vanderbauwhede, Wim
    [J]. 2019 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2019, : 87 - 90
  • [26] A GPGPU Compiler for Memory Optimization and Parallelism Management
    Yang, Yi
    Xiang, Ping
    Kong, Jingfei
    Zhou, Huiyang
    [J]. ACM SIGPLAN NOTICES, 2010, 45 (06) : 86 - 97
  • [27] A GPGPU Compiler for Memory Optimization and Parallelism Management
    Yang, Yi
    Xiang, Ping
    Kong, Jingfei
    Zhou, Huiyang
    [J]. PLDI '10: PROCEEDINGS OF THE 2010 ACM SIGPLAN CONFERENCE ON PROGRAMMING LANGUAGE DESIGN AND IMPLEMENTATION, 2010, : 86 - 97
  • [28] Quantifying the performance and energy efficiency of advanced cache indexing for GPGPU computing
    Kim, Kyu Yeun
    Baek, Woongki
    [J]. MICROPROCESSORS AND MICROSYSTEMS, 2016, 43 : 81 - 94
  • [29] Incorporating selective victim cache into GPGPU for high-performance computing
    Wang, Jianfei
    Fan, Fengfeng
    Jiang, Li
    Liang, Xiaoyao
    Jing, Naifeng
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2017, 29 (24):
  • [30] Tyche: An Efficient and General Prefetcher for Indirect Memory Accesses
    Xue, Feng
    Han, Chenji
    Li, Xinyu
    Wu, Junliang
    Zhang, Tingting
    Liu, Tianyi
    Hao, Yifan
    Du, Zidong
    Guo, Qi
    Zhang, Fuxin
    [J]. ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2024, 21 (02)