Efficient Management of Cache Accesses to Boost GPGPU Memory Subsystem Performance

被引：0

作者：

Candel, Francisco ^{[1
]}

Valero, Alejandro ^{[2
]}

Petit, Salvador ^{[1
]}

Sahuquillo, Julio ^{[1
]}

机构：

[1] Univ Politecn Valencia, Dept Comp Engn, E-46022 Valencia, Spain

[2] Univ Zaragoza, Inst Univ Ingn Aragon, Dept Informat & Ingn Sistemas, E-50009 Zaragoza, Spain

来源：

IEEE TRANSACTIONS ON COMPUTERS | 2019年 / 68卷 / 10期

关键词：

GPU; memory hierarchy; miss management;

D O I：

10.1109/TC.2019.2907591

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

To support the massive amount of memory accesses that GPGPU applications generate, GPU memory hierarchies are becoming more and more complex, and the Last Level Cache (LLC) size considerably increases each GPU generation. This paper shows that counter-intuitively, enlarging the LLC brings marginal performance gains in most applications. In other words, increasing the LLC size does not scale neither in performance nor energy consumption. We examine how LLC misses are managed in typical GPUs, and we find that in most cases the way LLC misses are managed are precisely the main performance limiter. This paper proposes a novel approach that addresses this shortcoming by leveraging a tiny additional Fetch and Replacement Cache-like structure (FRC) that stores control and coherence information of the incoming blocks until they are fetched from main memory. Then, the fetched blocks are swapped with the victim blocks (i.e., selected to be replaced) in the LLC, and the eviction of such victim blocks is performed from the FRC. This approach improves performance due to three main reasons: i) the lifetime of blocks being replaced is enlarged, ii) the main memory path is unclogged on long bursts of LLC misses, and iii) the average LLC miss latency is reduced. The proposal improves the LLC hit ratio, memory-level parallelism, and reduces the miss latency compared to much larger conventional caches. Moreover, this is achieved with reduced energy consumption and with much less area requirements. Experimental results show that the proposed FRC cache scales in performance with the number of GPU compute units and the LLC size, since, depending on the FRC size, performance improves ranging from 30 to 67 percent for a modern baseline GPU card, and from 32 to 118 percent for a larger GPU. In addition, energy consumption is reduced on average from 49 to 57 percent for the larger GPU. These benefits come with a small area increase (by 7.3 percent) over the LLC baseline.

引用

页码：1442 / 1454

页数：13

共 50 条

[1] Orchestrating Cache Management and Memory Scheduling for GPGPU Applications
Mu, Shuai
Deng, Yandong
Chen, Yubei
Li, Huaiming
Pan, Jianming
Zhang, Wenjun
Wang, Zhihua
[J]. IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2014, 22 (08) : 1803 - 1814
[2] Energy-Efficient Stream Compaction Through Filtering and Coalescing Accesses in GPGPU Memory Partitions
Segura, Albert
Arnau, Jose-Maria
Gonzalez, Antonio
[J]. IEEE TRANSACTIONS ON COMPUTERS, 2022, 71 (07) : 1711 - 1723
[3] IACM: Integrated Adaptive Cache Management for High-Performance and Energy-Efficient GPGPU Computing
Kim, Kyu Yeun
Park, Jinsu
Baek, Woongki
[J]. PROCEEDINGS OF THE 34TH IEEE INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD), 2016, : 380 - 383
[4] DyCache: Dynamic Multi-Grain Cache Management for Irregular Memory Accesses on GPU
Guo, Hui
Huang, Libo
Lu, Yashuai
Ma, Sheng
Wang, Zhiying
[J]. IEEE ACCESS, 2018, 6 : 38881 - 38891
[5] BOOST PROCESSOR PERFORMANCE WITH 2-LEVEL CACHE MEMORY
DEVANE, CJ
LIDINGTON, G
[J]. ELECTRONIC DESIGN, 1988, 36 (13) : 97 - &
[6] Cache performance analysis of traversals and random accesses
Ladner, RE
Fix, JD
LaMarca, A
[J]. PROCEEDINGS OF THE TENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, 1999, : 613 - 622
[7] CART: Cache Access Reordering Tree for Fiticient Cache and Memory Accesses in GPUs
Gu, Yongbin
Chen, Lizhong
[J]. 2018 IEEE 36TH INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD), 2018, : 250 - 257
[8] Efficient cache invalidation schemes for mobile data accesses
Chuang, Po-Jen
Chiu, Yu-Shian
[J]. INFORMATION SCIENCES, 2011, 181 (22) : 5084 - 5101
[9] BACM: Barrier-Aware Cache Management for Irregular Memory-Intensive GPGPU Workloads
Liu, Yuxi
Zhao, Xia
Yu, Zhibin
Wang, Zhenlin
Wang, Xiaolin
Luo, Yingwei
Eeckhout, Lieven
[J]. 2017 IEEE 35TH INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD), 2017, : 633 - 640
[10] Improving the Performance and Energy Efficiency of GPGPU Computing through Integrated Adaptive Cache Management
Kim, Kyu Yeun
Park, Jinsu
Baek, Woongki
[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2019, 30 (03) : 630 - 645

← 1 2 3 4 5 →