Understanding and Optimizing GPU Cache Memory Performance for Compute Workloads

被引:9
|
作者
Choo, Kyoshin [1 ]
Panlener, William [1 ]
Jang, Byunghyun [1 ]
机构
[1] Univ Mississippi, Comp & Informat Sci, University, MS 38677 USA
关键词
D O I
10.1109/ISPDC.2014.29
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Processing elements such as CPUs and GPUs depend on cache technology to bridge the classic processor memory subsystem performance gap. As GPUs evolve into general-purpose co-processors with CPUs sharing the load, good cache design and use becomes increasingly important. While both CPUs and GPUs must cooperate and perform well, their memory access patterns are very different. On CPUs only a few threads access memory simultaneously. On GPUs, there is significantly higher memory access contention among thousands of threads. Despite such different behavior, there is little research that investigates the behavior and performance of GPU caches in depth. In this paper, we present our extensive study on the characterization and improvement of GPU cache behavior and performance for general-purpose workloads using a cycle-accurate ISA level GPU architectural simulator that models one of the latest GPU architectures, Graphics Core Next (GCN) from AMD. Our study makes the following observations and improvements. First, we observe that L1 vector data cache hit rate is substantially lower when compared to CPU caches. The main culprit is compulsory misses caused by lack of data reuse among massively simultaneous threads. Second, there is significant memory access contention in shared L2 data cache, accounting for up to 19% of total access for some benchmarks. This high contention remains a main performance barrier in L2 data cache even though its hit rate is high. Third, we demonstrate that memory access coalescing plays a critical role in reducing memory traffic. Finally we found that there exists inter-workgroup locality which can affect the cache behavior and performance. Our experimental results show memory performance can be improved by 1) shared L1 vector data cache where multiple compute units share a single cache to exploit inter-workgroup locality and increase data reusability, and 2) clustered workgroup scheduling where workgroups with consecutive IDs are assigned on the same compute unit.
引用
收藏
页码:189 / 196
页数:8
相关论文
共 50 条
  • [21] Understanding the Behavior of In-Memory Computing Workloads
    Jiang, Tao
    Zhang, Qianlong
    Hou, Rui
    Chai, Lin
    Mckee, Sally A.
    Jia, Zhen
    Sun, Ninghui
    2014 IEEE INTERNATIONAL SYMPOSIUM ON WORKLOAD CHARACTERIZATION (IISWC), 2014, : 22 - 30
  • [22] Optimizing GPU Kernels for Irregular Batch Workloads: A Case Study for Cholesky Factorization
    Abdelfattah, Ahmad
    Haidar, Azzam
    Tomov, Stanimire
    Dongarra, Jack
    2018 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2018,
  • [23] Cache Reuse Aware Replacement Policy for Improving GPU Cache Performance
    Son, Dong Oh
    Kim, Gwang Bok
    Kim, Jong Myon
    Kim, Cheol Hong
    IT CONVERGENCE AND SECURITY 2017, VOL 2, 2018, 450 : 127 - 133
  • [24] Performance evaluation of a novel CMP cache structure for hybrid workloads
    Zhao, Xuemei
    Sammut, Karl
    He, Fangpo
    EIGHTH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING, APPLICATIONS AND TECHNOLOGIES, PROCEEDINGS, 2007, : 89 - 96
  • [25] CONTRASTING CHARACTERISTICS AND CACHE PERFORMANCE OF TECHNICAL AND MULTIUSER COMMERCIAL WORKLOADS
    MAYNARD, AMG
    DONNELLY, CM
    OLSZEWSKI, BR
    SIGPLAN NOTICES, 1994, 29 (11): : 145 - 156
  • [26] A Quantitative Study of Locality in GPU Caches for Memory-Divergent Workloads
    Lal, Sohan
    Varma, Bogaraju Sharatchandra
    Juurlink, Ben
    INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2022, 50 (02) : 189 - 216
  • [27] A Quantitative Study of Locality in GPU Caches for Memory-Divergent Workloads
    Sohan Lal
    Bogaraju Sharatchandra Varma
    Ben Juurlink
    International Journal of Parallel Programming, 2022, 50 : 189 - 216
  • [28] Optimizing GPU Memory Transactions for Convolution Operations
    Lu, Gangzhao
    Zhang, Weizhe
    Wang, Zheng
    2020 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2020), 2020, : 399 - 403
  • [29] C-Lash: A Cache System for Optimizing NAND Flash Memory Performance and Lifetime
    Boukhobza, Jalil
    Olivier, Pierre
    DIGITAL INFORMATION AND COMMUNICATION TECHNOLOGY AND ITS APPLICATIONS, PT II, 2011, 167 (02): : 599 - +
  • [30] Accelerating Performance of GPU-based Workloads Using CXL
    Arif, Moiz
    Maurya, Avinash
    Rafique, M. Mustafa
    PROCEEDINGS OF THE 13TH WORKSHOP ON AI AND SCIENTIFIC COMPUTING AT SCALE USING FLEXIBLE COMPUTING INFRASTRUCTURES, FLEXSCIENCE 2023, 2023, : 27 - 31