Efficient Management of Cache Accesses to Boost GPGPU Memory Subsystem Performance

被引：0

作者：

Candel, Francisco ^{[1
]}

Valero, Alejandro ^{[2
]}

Petit, Salvador ^{[1
]}

Sahuquillo, Julio ^{[1
]}

机构：

[1] Univ Politecn Valencia, Dept Comp Engn, E-46022 Valencia, Spain

[2] Univ Zaragoza, Inst Univ Ingn Aragon, Dept Informat & Ingn Sistemas, E-50009 Zaragoza, Spain

来源：

IEEE TRANSACTIONS ON COMPUTERS | 2019年 / 68卷 / 10期

关键词：

GPU; memory hierarchy; miss management;

D O I：

10.1109/TC.2019.2907591

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

To support the massive amount of memory accesses that GPGPU applications generate, GPU memory hierarchies are becoming more and more complex, and the Last Level Cache (LLC) size considerably increases each GPU generation. This paper shows that counter-intuitively, enlarging the LLC brings marginal performance gains in most applications. In other words, increasing the LLC size does not scale neither in performance nor energy consumption. We examine how LLC misses are managed in typical GPUs, and we find that in most cases the way LLC misses are managed are precisely the main performance limiter. This paper proposes a novel approach that addresses this shortcoming by leveraging a tiny additional Fetch and Replacement Cache-like structure (FRC) that stores control and coherence information of the incoming blocks until they are fetched from main memory. Then, the fetched blocks are swapped with the victim blocks (i.e., selected to be replaced) in the LLC, and the eviction of such victim blocks is performed from the FRC. This approach improves performance due to three main reasons: i) the lifetime of blocks being replaced is enlarged, ii) the main memory path is unclogged on long bursts of LLC misses, and iii) the average LLC miss latency is reduced. The proposal improves the LLC hit ratio, memory-level parallelism, and reduces the miss latency compared to much larger conventional caches. Moreover, this is achieved with reduced energy consumption and with much less area requirements. Experimental results show that the proposed FRC cache scales in performance with the number of GPU compute units and the LLC size, since, depending on the FRC size, performance improves ranging from 30 to 67 percent for a modern baseline GPU card, and from 32 to 118 percent for a larger GPU. In addition, energy consumption is reduced on average from 49 to 57 percent for the larger GPU. These benefits come with a small area increase (by 7.3 percent) over the LLC baseline.

引用

页码：1442 / 1454

页数：13

共 50 条

[21] Automatic Sublining for Efficient Sparse Memory Accesses
Heirman, Wim
Eyerman, Stijn
Du Bois, Kristof
Hur, Ibrahim
[J]. ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2021, 18 (03)
[22] Efficient Hashing with Lookups in two Memory Accesses
Panigrahy, Rina
[J]. PROCEEDINGS OF THE SIXTEENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, 2005, : 830 - 839
[23] Irregular accesses reorder unit: improving GPGPU memory coalescing for graph-based workloads
Albert Segura
Jose Maria Arnau
Antonio Gonzalez
[J]. The Journal of Supercomputing, 2023, 79 : 762 - 787
[24] Irregular accesses reorder unit: improving GPGPU memory coalescing for graph-based workloads
Segura, Albert
Arnau, Jose Maria
Gonzalez, Antonio
[J]. JOURNAL OF SUPERCOMPUTING, 2023, 79 (01): : 762 - 787
[25] Smart-Cache: Optimising Memory Accesses for Arbitrary Boundaries and Stencils on FPGAs
Nabi, Syed Waqar
Vanderbauwhede, Wim
[J]. 2019 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2019, : 87 - 90
[26] A GPGPU Compiler for Memory Optimization and Parallelism Management
Yang, Yi
Xiang, Ping
Kong, Jingfei
Zhou, Huiyang
[J]. ACM SIGPLAN NOTICES, 2010, 45 (06) : 86 - 97
[27] A GPGPU Compiler for Memory Optimization and Parallelism Management
Yang, Yi
Xiang, Ping
Kong, Jingfei
Zhou, Huiyang
[J]. PLDI '10: PROCEEDINGS OF THE 2010 ACM SIGPLAN CONFERENCE ON PROGRAMMING LANGUAGE DESIGN AND IMPLEMENTATION, 2010, : 86 - 97
[28] Quantifying the performance and energy efficiency of advanced cache indexing for GPGPU computing
Kim, Kyu Yeun
Baek, Woongki
[J]. MICROPROCESSORS AND MICROSYSTEMS, 2016, 43 : 81 - 94
[29] Incorporating selective victim cache into GPGPU for high-performance computing
Wang, Jianfei
Fan, Fengfeng
Jiang, Li
Liang, Xiaoyao
Jing, Naifeng
[J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2017, 29 (24):
[30] Tyche: An Efficient and General Prefetcher for Indirect Memory Accesses
Xue, Feng
Han, Chenji
Li, Xinyu
Wu, Junliang
Zhang, Tingting
Liu, Tianyi
Hao, Yifan
Du, Zidong
Guo, Qi
Zhang, Fuxin
[J]. ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2024, 21 (02)

← 1 2 3 4 5 →