Understanding and Optimizing GPU Cache Memory Performance for Compute Workloads

被引：9

作者：

Choo, Kyoshin ^{[1
]}

Panlener, William ^{[1
]}

Jang, Byunghyun ^{[1
]}

机构：

[1] Univ Mississippi, Comp & Informat Sci, University, MS 38677 USA

来源：

2014 IEEE 13TH INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED COMPUTING (ISPDC) | 2014年

关键词：

D O I：

10.1109/ISPDC.2014.29

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Processing elements such as CPUs and GPUs depend on cache technology to bridge the classic processor memory subsystem performance gap. As GPUs evolve into general-purpose co-processors with CPUs sharing the load, good cache design and use becomes increasingly important. While both CPUs and GPUs must cooperate and perform well, their memory access patterns are very different. On CPUs only a few threads access memory simultaneously. On GPUs, there is significantly higher memory access contention among thousands of threads. Despite such different behavior, there is little research that investigates the behavior and performance of GPU caches in depth. In this paper, we present our extensive study on the characterization and improvement of GPU cache behavior and performance for general-purpose workloads using a cycle-accurate ISA level GPU architectural simulator that models one of the latest GPU architectures, Graphics Core Next (GCN) from AMD. Our study makes the following observations and improvements. First, we observe that L1 vector data cache hit rate is substantially lower when compared to CPU caches. The main culprit is compulsory misses caused by lack of data reuse among massively simultaneous threads. Second, there is significant memory access contention in shared L2 data cache, accounting for up to 19% of total access for some benchmarks. This high contention remains a main performance barrier in L2 data cache even though its hit rate is high. Third, we demonstrate that memory access coalescing plays a critical role in reducing memory traffic. Finally we found that there exists inter-workgroup locality which can affect the cache behavior and performance. Our experimental results show memory performance can be improved by 1) shared L1 vector data cache where multiple compute units share a single cache to exploit inter-workgroup locality and increase data reusability, and 2) clustered workgroup scheduling where workgroups with consecutive IDs are assigned on the same compute unit.

引用

页码：189 / 196

页数：8

共 50 条

[21] Understanding the Behavior of In-Memory Computing Workloads
Jiang, Tao
Zhang, Qianlong
Hou, Rui
Chai, Lin
Mckee, Sally A.
Jia, Zhen
Sun, Ninghui
2014 IEEE INTERNATIONAL SYMPOSIUM ON WORKLOAD CHARACTERIZATION (IISWC), 2014, : 22 - 30
[22] Optimizing GPU Kernels for Irregular Batch Workloads: A Case Study for Cholesky Factorization
Abdelfattah, Ahmad
Haidar, Azzam
Tomov, Stanimire
Dongarra, Jack
2018 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2018,
[23] Cache Reuse Aware Replacement Policy for Improving GPU Cache Performance
Son, Dong Oh
Kim, Gwang Bok
Kim, Jong Myon
Kim, Cheol Hong
IT CONVERGENCE AND SECURITY 2017, VOL 2, 2018, 450 : 127 - 133
[24] Performance evaluation of a novel CMP cache structure for hybrid workloads
Zhao, Xuemei
Sammut, Karl
He, Fangpo
EIGHTH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING, APPLICATIONS AND TECHNOLOGIES, PROCEEDINGS, 2007, : 89 - 96
[25] CONTRASTING CHARACTERISTICS AND CACHE PERFORMANCE OF TECHNICAL AND MULTIUSER COMMERCIAL WORKLOADS
MAYNARD, AMG
DONNELLY, CM
OLSZEWSKI, BR
SIGPLAN NOTICES, 1994, 29 (11): : 145 - 156
[26] A Quantitative Study of Locality in GPU Caches for Memory-Divergent Workloads
Lal, Sohan
Varma, Bogaraju Sharatchandra
Juurlink, Ben
INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2022, 50 (02) : 189 - 216
[27] A Quantitative Study of Locality in GPU Caches for Memory-Divergent Workloads
Sohan Lal
Bogaraju Sharatchandra Varma
Ben Juurlink
International Journal of Parallel Programming, 2022, 50 : 189 - 216
[28] Optimizing GPU Memory Transactions for Convolution Operations
Lu, Gangzhao
Zhang, Weizhe
Wang, Zheng
2020 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2020), 2020, : 399 - 403
[29] C-Lash: A Cache System for Optimizing NAND Flash Memory Performance and Lifetime
Boukhobza, Jalil
Olivier, Pierre
DIGITAL INFORMATION AND COMMUNICATION TECHNOLOGY AND ITS APPLICATIONS, PT II, 2011, 167 (02): : 599 - +
[30] Accelerating Performance of GPU-based Workloads Using CXL
Arif, Moiz
Maurya, Avinash
Rafique, M. Mustafa
PROCEEDINGS OF THE 13TH WORKSHOP ON AI AND SCIENTIFIC COMPUTING AT SCALE USING FLEXIBLE COMPUTING INFRASTRUCTURES, FLEXSCIENCE 2023, 2023, : 27 - 31

← 1 2 3 4 5 →