Orchestrating Cache Management and Memory Scheduling for GPGPU Applications

被引:17
|
作者
Mu, Shuai [1 ]
Deng, Yandong [1 ]
Chen, Yubei [1 ]
Li, Huaiming [1 ]
Pan, Jianming [1 ]
Zhang, Wenjun [1 ]
Wang, Zhihua [1 ]
机构
[1] Inst Microelect Circuit & Syst, Beijing 100015, Peoples R China
关键词
Cache management; general purpose computing on graphics processing units (GPGPU); memory latency divergence; memory scheduling; priority; warp; OPTIMIZATION; PERFORMANCE;
D O I
10.1109/TVLSI.2013.2278025
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Modern graphics processing units (GPUs) are delivering tremendous computing horsepower by running tens of thousands of threads concurrently. The massively parallel execution model has been effective to hide the long latency of off-chip memory accesses in graphics and other general computing applications exhibiting regular memory behaviors. With the fast-growing demand for general purpose computing on GPUs (GPGPU), GPU workloads are becoming highly diversified, and thus requiring a synergistic coordination of both computing and memory resources to unleash the computing power of GPUs. Accordingly, recent graphics processors begin to integrate an on-die level-2 (L2) cache. The huge number of threads on GPUs, however, poses significant challenges to L2 cache design. The experiments on a variety of GPGPU applications reveal that the L2 cache may or may not improve the overall performance depending on the characteristics of applications. In this paper, we propose efficient techniques to improve GPGPU performance by orchestrating both L2 cache and memory in a unified framework. The basic philosophy is to exploit the temporal locality among the massive number of concurrent memory requests and minimize the impact of memory divergence behaviors among simultaneously executed groups of threads. Our major contributions are twofold. First, a priority-based cache management is proposed to maximize the chance of frequently revisited data to be kept in the cache. Second, an effective memory scheduling is introduced to reorder memory requests in the memory controller according to the divergence behavior for reducing average waiting time of warps. Simulation results reveal that our techniques enhance the overall performance by 10% on average for memory intensive benchmarks, whereas the maximum gain can be up to 30%.
引用
收藏
页码:1803 / 1814
页数:12
相关论文
共 50 条
  • [1] Efficient Management of Cache Accesses to Boost GPGPU Memory Subsystem Performance
    Candel, Francisco
    Valero, Alejandro
    Petit, Salvador
    Sahuquillo, Julio
    [J]. IEEE TRANSACTIONS ON COMPUTERS, 2019, 68 (10) : 1442 - 1454
  • [2] BACM: Barrier-Aware Cache Management for Irregular Memory-Intensive GPGPU Workloads
    Liu, Yuxi
    Zhao, Xia
    Yu, Zhibin
    Wang, Zhenlin
    Wang, Xiaolin
    Luo, Yingwei
    Eeckhout, Lieven
    [J]. 2017 IEEE 35TH INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD), 2017, : 633 - 640
  • [3] Enhancing Data Reuse in Cache Contention Aware Thread Scheduling on GPGPU
    Lu, Chin-Fu
    Kuo, Hsien-Kai
    Lai, Bo-Cheng Charles
    [J]. PROCEEDINGS OF 2016 10TH INTERNATIONAL CONFERENCE ON COMPLEX, INTELLIGENT, AND SOFTWARE INTENSIVE SYSTEMS (CISIS), 2016, : 351 - 356
  • [4] POSTER: BACM: Barrier-Aware Cache Management for Irregular Memory-Intensive GPGPU Workloads
    Liu, Yuxi
    Zhao, Xia
    Yu, Zhibin
    Wang, Zhenlin
    Wang, Xiaolin
    Luo, Yingwei
    Eeckhout, Lieven
    [J]. 2017 26TH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT), 2017, : 140 - 141
  • [5] Improving GPGPU Performance via Cache Locality Aware Thread Block Scheduling
    Chen, Li-Jhan
    Cheng, Hsiang-Yun
    Wang, Po-Han
    Yang, Chia-Lin
    [J]. IEEE COMPUTER ARCHITECTURE LETTERS, 2017, 16 (02) : 127 - 131
  • [6] The Impact of Cache and Dynamic Memory Management in Static Dataflow Applications
    Alemeh Ghasemi
    Marcelo Ruaro
    Rodrigo Cataldo
    Jean-Philippe Diguet
    Kevin J. M. Martin
    [J]. Journal of Signal Processing Systems, 2022, 94 : 721 - 738
  • [7] The Impact of Cache and Dynamic Memory Management in Static Dataflow Applications
    Ghasemi, Alemeh
    Ruaro, Marcelo
    Cataldo, Rodrigo
    Diguet, Jean-Philippe
    Martin, Kevin J. M.
    [J]. JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2022, 94 (07): : 721 - 738
  • [8] AdaptSize: Orchestrating the Hot Object Memory Cache in a Content Delivery Network
    Berger, Daniel S.
    Sitaraman, Ramesh K.
    Harchol-Balter, Mor
    [J]. PROCEEDINGS OF NSDI '17: 14TH USENIX SYMPOSIUM ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION, 2017, : 483 - 498
  • [9] Shared Last-Level Cache Management and Memory Scheduling for GPGPUs with Hybrid Main Memory
    Wang, Guan
    Zang, Chuanqi
    Ju, Lei
    Zhao, Mengying
    Cai, Xiaojun
    Jia, Zhiping
    [J]. ACM TRANSACTIONS ON EMBEDDED COMPUTING SYSTEMS, 2018, 17 (04)
  • [10] A GPGPU Compiler for Memory Optimization and Parallelism Management
    Yang, Yi
    Xiang, Ping
    Kong, Jingfei
    Zhou, Huiyang
    [J]. ACM SIGPLAN NOTICES, 2010, 45 (06) : 86 - 97