Efficient Management of Cache Accesses to Boost GPGPU Memory Subsystem Performance

被引:0
|
作者
Candel, Francisco [1 ]
Valero, Alejandro [2 ]
Petit, Salvador [1 ]
Sahuquillo, Julio [1 ]
机构
[1] Univ Politecn Valencia, Dept Comp Engn, E-46022 Valencia, Spain
[2] Univ Zaragoza, Inst Univ Ingn Aragon, Dept Informat & Ingn Sistemas, E-50009 Zaragoza, Spain
关键词
GPU; memory hierarchy; miss management;
D O I
10.1109/TC.2019.2907591
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
To support the massive amount of memory accesses that GPGPU applications generate, GPU memory hierarchies are becoming more and more complex, and the Last Level Cache (LLC) size considerably increases each GPU generation. This paper shows that counter-intuitively, enlarging the LLC brings marginal performance gains in most applications. In other words, increasing the LLC size does not scale neither in performance nor energy consumption. We examine how LLC misses are managed in typical GPUs, and we find that in most cases the way LLC misses are managed are precisely the main performance limiter. This paper proposes a novel approach that addresses this shortcoming by leveraging a tiny additional Fetch and Replacement Cache-like structure (FRC) that stores control and coherence information of the incoming blocks until they are fetched from main memory. Then, the fetched blocks are swapped with the victim blocks (i.e., selected to be replaced) in the LLC, and the eviction of such victim blocks is performed from the FRC. This approach improves performance due to three main reasons: i) the lifetime of blocks being replaced is enlarged, ii) the main memory path is unclogged on long bursts of LLC misses, and iii) the average LLC miss latency is reduced. The proposal improves the LLC hit ratio, memory-level parallelism, and reduces the miss latency compared to much larger conventional caches. Moreover, this is achieved with reduced energy consumption and with much less area requirements. Experimental results show that the proposed FRC cache scales in performance with the number of GPU compute units and the LLC size, since, depending on the FRC size, performance improves ranging from 30 to 67 percent for a modern baseline GPU card, and from 32 to 118 percent for a larger GPU. In addition, energy consumption is reduced on average from 49 to 57 percent for the larger GPU. These benefits come with a small area increase (by 7.3 percent) over the LLC baseline.
引用
收藏
页码:1442 / 1454
页数:13
相关论文
共 50 条
  • [1] Orchestrating Cache Management and Memory Scheduling for GPGPU Applications
    Mu, Shuai
    Deng, Yandong
    Chen, Yubei
    Li, Huaiming
    Pan, Jianming
    Zhang, Wenjun
    Wang, Zhihua
    [J]. IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2014, 22 (08) : 1803 - 1814
  • [2] Energy-Efficient Stream Compaction Through Filtering and Coalescing Accesses in GPGPU Memory Partitions
    Segura, Albert
    Arnau, Jose-Maria
    Gonzalez, Antonio
    [J]. IEEE TRANSACTIONS ON COMPUTERS, 2022, 71 (07) : 1711 - 1723
  • [3] IACM: Integrated Adaptive Cache Management for High-Performance and Energy-Efficient GPGPU Computing
    Kim, Kyu Yeun
    Park, Jinsu
    Baek, Woongki
    [J]. PROCEEDINGS OF THE 34TH IEEE INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD), 2016, : 380 - 383
  • [4] DyCache: Dynamic Multi-Grain Cache Management for Irregular Memory Accesses on GPU
    Guo, Hui
    Huang, Libo
    Lu, Yashuai
    Ma, Sheng
    Wang, Zhiying
    [J]. IEEE ACCESS, 2018, 6 : 38881 - 38891
  • [5] BOOST PROCESSOR PERFORMANCE WITH 2-LEVEL CACHE MEMORY
    DEVANE, CJ
    LIDINGTON, G
    [J]. ELECTRONIC DESIGN, 1988, 36 (13) : 97 - &
  • [6] Cache performance analysis of traversals and random accesses
    Ladner, RE
    Fix, JD
    LaMarca, A
    [J]. PROCEEDINGS OF THE TENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, 1999, : 613 - 622
  • [7] CART: Cache Access Reordering Tree for Fiticient Cache and Memory Accesses in GPUs
    Gu, Yongbin
    Chen, Lizhong
    [J]. 2018 IEEE 36TH INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD), 2018, : 250 - 257
  • [8] Efficient cache invalidation schemes for mobile data accesses
    Chuang, Po-Jen
    Chiu, Yu-Shian
    [J]. INFORMATION SCIENCES, 2011, 181 (22) : 5084 - 5101
  • [9] BACM: Barrier-Aware Cache Management for Irregular Memory-Intensive GPGPU Workloads
    Liu, Yuxi
    Zhao, Xia
    Yu, Zhibin
    Wang, Zhenlin
    Wang, Xiaolin
    Luo, Yingwei
    Eeckhout, Lieven
    [J]. 2017 IEEE 35TH INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD), 2017, : 633 - 640
  • [10] Improving the Performance and Energy Efficiency of GPGPU Computing through Integrated Adaptive Cache Management
    Kim, Kyu Yeun
    Park, Jinsu
    Baek, Woongki
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2019, 30 (03) : 630 - 645