Understanding and Optimizing GPU Cache Memory Performance for Compute Workloads

被引：9

作者：

Choo, Kyoshin ^{[1
]}

Panlener, William ^{[1
]}

Jang, Byunghyun ^{[1
]}

机构：

[1] Univ Mississippi, Comp & Informat Sci, University, MS 38677 USA

来源：

2014 IEEE 13TH INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED COMPUTING (ISPDC) | 2014年

关键词：

D O I：

10.1109/ISPDC.2014.29

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Processing elements such as CPUs and GPUs depend on cache technology to bridge the classic processor memory subsystem performance gap. As GPUs evolve into general-purpose co-processors with CPUs sharing the load, good cache design and use becomes increasingly important. While both CPUs and GPUs must cooperate and perform well, their memory access patterns are very different. On CPUs only a few threads access memory simultaneously. On GPUs, there is significantly higher memory access contention among thousands of threads. Despite such different behavior, there is little research that investigates the behavior and performance of GPU caches in depth. In this paper, we present our extensive study on the characterization and improvement of GPU cache behavior and performance for general-purpose workloads using a cycle-accurate ISA level GPU architectural simulator that models one of the latest GPU architectures, Graphics Core Next (GCN) from AMD. Our study makes the following observations and improvements. First, we observe that L1 vector data cache hit rate is substantially lower when compared to CPU caches. The main culprit is compulsory misses caused by lack of data reuse among massively simultaneous threads. Second, there is significant memory access contention in shared L2 data cache, accounting for up to 19% of total access for some benchmarks. This high contention remains a main performance barrier in L2 data cache even though its hit rate is high. Third, we demonstrate that memory access coalescing plays a critical role in reducing memory traffic. Finally we found that there exists inter-workgroup locality which can affect the cache behavior and performance. Our experimental results show memory performance can be improved by 1) shared L1 vector data cache where multiple compute units share a single cache to exploit inter-workgroup locality and increase data reusability, and 2) clustered workgroup scheduling where workgroups with consecutive IDs are assigned on the same compute unit.

引用

页码：189 / 196

页数：8

共 50 条

[1] Optimizing GPU Cache Policies for MI Workloads
Alsop, Johnathan
Sinclair, Matthew D.
Bharadwaj, Srikant
Dutu, Alexandru
Gutierrez, Anthony
Kayiran, Onur
LeBeane, Michael
Potter, Brandon
Puthoor, Sooraj
Zhang, Xianwei
Yeh, Tsung Tai
Beckmann, Bradford M.
PROCEEDINGS OF THE 2019 IEEE INTERNATIONAL SYMPOSIUM ON WORKLOAD CHARACTERIZATION (IISWC 2019), 2019, : 243 - 248
[2] Revisiting The Vertex Cache: Understanding and Optimizing Vertex Processing on the modern GPU
Kerbl, Bernhard
Kenzel, Michael
Ivanchenko, Elena
Schmalstieg, Dieter
Steinberger, Markus
PROCEEDINGS OF THE ACM ON COMPUTER GRAPHICS AND INTERACTIVE TECHNIQUES, 2018, 1 (02)
[3] Characterizing Large Dataset GPU Compute Workloads Targeting Systems with Die-Stacked Memory
Ramanathan, Srividya
Hazari, Gautam
Lahiri, Kanishka
Spadini, Francesco
2015 IEEE 22ND INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC), 2015, : 204 - 213
[4] A practical performance model for compute and memory bound GPU kernels
Konstantinidis, Elias
Cotronis, Yiannis
23RD EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, AND NETWORK-BASED PROCESSING (PDP 2015), 2015, : 651 - 658
[5] Cache performance of video computation workloads
Petko, S
Kudithipudi, D
John, E
THIRD INTERNATIONAL WORKSHOP ON DIGITAL AND COMPUTATIONAL VIDEO, PROCEEDINGS, 2002, : 169 - 175
[6] Optimizing Deep Learning Workloads on ARM GPU with TVM
Zheng, Lianmin
Chen, Tianqi
1ST ACM REQUEST WORKSHOP/TOURNAMENT ON REPRODUCIBLE SOFTWARE/HARDWARE CO-DESIGN OF PARETO-EFFICIENT DEEP LEARNING, 2018,
[7] (Mis)Understanding the NUMA Memory System Performance of Multithreaded Workloads
Majo, Zoltan
Gross, Thomas R.
2013 IEEE INTERNATIONAL SYMPOSIUM ON WORKLOAD CHARACTERIZATION (IISWC 2013), 2013, : 11 - 22
[8] Exploring Shared Memory and Cache to Improve GPU Performance and Energy Efficiency
Wen, Hao
Zhang, Wei
PROCEEDINGS OF THE SIXTEENTH INTERNATIONAL SYMPOSIUM ON QUALITY ELECTRONIC DESIGN (ISQED 2015), 2015, : 397 - 400
[9] Optimizing Private Memory Performance By Dynamically Deactivating Cache Coherence
Wang Shaogang
Xu Weixia
Pang Zhengbin
Wu Dan
Dai Yi
Lu Pingjing
2012 IEEE 14TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS & 2012 IEEE 9TH INTERNATIONAL CONFERENCE ON EMBEDDED SOFTWARE AND SYSTEMS (HPCC-ICESS), 2012, : 1112 - 1117
[10] Optimizing Amazon SageMaker Workloads with Predictive Compute Type Selection Strategies
Srivastava, Kavita
Agarwal, Manisha
ADVANCED NETWORK TECHNOLOGIES AND INTELLIGENT COMPUTING, ANTIC 2023, PT II, 2024, 2091 : 129 - 141

← 1 2 3 4 5 →