Understanding and Optimizing GPU Cache Memory Performance for Compute Workloads

被引:9
|
作者
Choo, Kyoshin [1 ]
Panlener, William [1 ]
Jang, Byunghyun [1 ]
机构
[1] Univ Mississippi, Comp & Informat Sci, University, MS 38677 USA
关键词
D O I
10.1109/ISPDC.2014.29
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Processing elements such as CPUs and GPUs depend on cache technology to bridge the classic processor memory subsystem performance gap. As GPUs evolve into general-purpose co-processors with CPUs sharing the load, good cache design and use becomes increasingly important. While both CPUs and GPUs must cooperate and perform well, their memory access patterns are very different. On CPUs only a few threads access memory simultaneously. On GPUs, there is significantly higher memory access contention among thousands of threads. Despite such different behavior, there is little research that investigates the behavior and performance of GPU caches in depth. In this paper, we present our extensive study on the characterization and improvement of GPU cache behavior and performance for general-purpose workloads using a cycle-accurate ISA level GPU architectural simulator that models one of the latest GPU architectures, Graphics Core Next (GCN) from AMD. Our study makes the following observations and improvements. First, we observe that L1 vector data cache hit rate is substantially lower when compared to CPU caches. The main culprit is compulsory misses caused by lack of data reuse among massively simultaneous threads. Second, there is significant memory access contention in shared L2 data cache, accounting for up to 19% of total access for some benchmarks. This high contention remains a main performance barrier in L2 data cache even though its hit rate is high. Third, we demonstrate that memory access coalescing plays a critical role in reducing memory traffic. Finally we found that there exists inter-workgroup locality which can affect the cache behavior and performance. Our experimental results show memory performance can be improved by 1) shared L1 vector data cache where multiple compute units share a single cache to exploit inter-workgroup locality and increase data reusability, and 2) clustered workgroup scheduling where workgroups with consecutive IDs are assigned on the same compute unit.
引用
收藏
页码:189 / 196
页数:8
相关论文
共 50 条
  • [1] Optimizing GPU Cache Policies for MI Workloads
    Alsop, Johnathan
    Sinclair, Matthew D.
    Bharadwaj, Srikant
    Dutu, Alexandru
    Gutierrez, Anthony
    Kayiran, Onur
    LeBeane, Michael
    Potter, Brandon
    Puthoor, Sooraj
    Zhang, Xianwei
    Yeh, Tsung Tai
    Beckmann, Bradford M.
    PROCEEDINGS OF THE 2019 IEEE INTERNATIONAL SYMPOSIUM ON WORKLOAD CHARACTERIZATION (IISWC 2019), 2019, : 243 - 248
  • [2] Revisiting The Vertex Cache: Understanding and Optimizing Vertex Processing on the modern GPU
    Kerbl, Bernhard
    Kenzel, Michael
    Ivanchenko, Elena
    Schmalstieg, Dieter
    Steinberger, Markus
    PROCEEDINGS OF THE ACM ON COMPUTER GRAPHICS AND INTERACTIVE TECHNIQUES, 2018, 1 (02)
  • [3] Characterizing Large Dataset GPU Compute Workloads Targeting Systems with Die-Stacked Memory
    Ramanathan, Srividya
    Hazari, Gautam
    Lahiri, Kanishka
    Spadini, Francesco
    2015 IEEE 22ND INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC), 2015, : 204 - 213
  • [4] A practical performance model for compute and memory bound GPU kernels
    Konstantinidis, Elias
    Cotronis, Yiannis
    23RD EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, AND NETWORK-BASED PROCESSING (PDP 2015), 2015, : 651 - 658
  • [5] Cache performance of video computation workloads
    Petko, S
    Kudithipudi, D
    John, E
    THIRD INTERNATIONAL WORKSHOP ON DIGITAL AND COMPUTATIONAL VIDEO, PROCEEDINGS, 2002, : 169 - 175
  • [6] Optimizing Deep Learning Workloads on ARM GPU with TVM
    Zheng, Lianmin
    Chen, Tianqi
    1ST ACM REQUEST WORKSHOP/TOURNAMENT ON REPRODUCIBLE SOFTWARE/HARDWARE CO-DESIGN OF PARETO-EFFICIENT DEEP LEARNING, 2018,
  • [7] (Mis)Understanding the NUMA Memory System Performance of Multithreaded Workloads
    Majo, Zoltan
    Gross, Thomas R.
    2013 IEEE INTERNATIONAL SYMPOSIUM ON WORKLOAD CHARACTERIZATION (IISWC 2013), 2013, : 11 - 22
  • [8] Exploring Shared Memory and Cache to Improve GPU Performance and Energy Efficiency
    Wen, Hao
    Zhang, Wei
    PROCEEDINGS OF THE SIXTEENTH INTERNATIONAL SYMPOSIUM ON QUALITY ELECTRONIC DESIGN (ISQED 2015), 2015, : 397 - 400
  • [9] Optimizing Private Memory Performance By Dynamically Deactivating Cache Coherence
    Wang Shaogang
    Xu Weixia
    Pang Zhengbin
    Wu Dan
    Dai Yi
    Lu Pingjing
    2012 IEEE 14TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS & 2012 IEEE 9TH INTERNATIONAL CONFERENCE ON EMBEDDED SOFTWARE AND SYSTEMS (HPCC-ICESS), 2012, : 1112 - 1117
  • [10] Optimizing Amazon SageMaker Workloads with Predictive Compute Type Selection Strategies
    Srivastava, Kavita
    Agarwal, Manisha
    ADVANCED NETWORK TECHNOLOGIES AND INTELLIGENT COMPUTING, ANTIC 2023, PT II, 2024, 2091 : 129 - 141