Locality Protected Dynamic Cache Allocation Scheme on GPUs

被引:0
|
作者
Zhang, Yang [1 ]
Xing, Zuocheng [1 ]
Zhou, Li [2 ]
Zhu, Chunsheng [3 ]
机构
[1] Natl Univ Def Technol, Natl Lab Parallel & Distributed Proc, Changsha, Hunan, Peoples R China
[2] Natl Univ Def Technol, Sch Elect Sci & Engn, Changsha, Hunan, Peoples R China
[3] Univ British Columbia, Dept Elect & Comp Engn, Vancouver, BC, Canada
关键词
PARALLELISM;
D O I
10.1109/TrustCom.2016.235
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
As we are approaching the exascale era in super-computing, designing a balanced computer system with powerful computing ability and low energy consumption becomes increasingly important. GPU is a widely used accelerator in most recently applied supercomputers. It adopts massive multithreads to hide long latency and has high energy efficiency. In contrast to its strong computing power, GPUs have few on-chip resources with several MB of fast on-chip memory storage per SM (Streaming Multiprocessors). GPU caches exhibit poor efficiency due to the mismatch of the throughput-oriented execution model and its cache hierarchy design. Since the severe deficiency in on-chip memory, the benefit of high computing capacity of GPUs is pulled down by the poor cache performance dramatically, which limits system performance and energy-efficiency. In this paper, we put forward a locality protected scheme to make full use of the data locality based on the fixed capacity. We present a Locality Protected method based on instruction PC (LPP) to promote GPU performance. Firstly, we use a PC-based collector to collect the reuse information of each cache line. After getting the dynamic reuse information of the cache line, we take an intelligent cache allocation unit (ICAU) which coordinates the reuse information with LRU (Least Recently Used) replacement policy to find out the cache line with the least locality for eviction. The results show that LPP provides an up to 17.8% speedup and an average of 5.5% improvement over the baseline method.
引用
收藏
页码:1524 / 1530
页数:7
相关论文
共 50 条
  • [41] Locality optimized unstructured mesh algorithms on GPUs
    Sulyok, Andras Attila
    Balogh, Gabor Daniel
    Reguly, Istvan Z.
    Mudalige, Gihan R.
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2019, 134 : 50 - 64
  • [42] An Efficient Compiler Framework for Cache Bypassing on GPUs
    Liang, Yun
    Xie, Xiaolong
    Sun, Guangyu
    Chen, Deming
    IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2015, 34 (10) : 1677 - 1690
  • [43] Optimizing Cache Bypassing and Warp Scheduling for GPUs
    Liang, Yun
    Xie, Xiaolong
    Wang, Yu
    Sun, Guangyu
    Wang, Tao
    IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2018, 37 (08) : 1560 - 1573
  • [44] Cache Locality Optimization for Recursive Programs
    Lifflander, Jonathan
    Krishnamoorthy, Sriram
    ACM SIGPLAN NOTICES, 2017, 52 (06) : 1 - 16
  • [45] QoS-Aware Dynamic Resource Allocation for Spatial-Multitasking GPUs
    Aguilera, Paula
    Morrow, Katherine
    Kim, Nam Sung
    2014 19TH ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE (ASP-DAC), 2014, : 726 - 731
  • [46] A dynamic and adaptive cache retrieval scheme for mobile computing systems
    Peng, WC
    Chen, MS
    3RD IFCIS INTERNATIONAL CONFERENCE ON COOPERATIVE INFORMATION SYSTEMS - PROCEEDINGS, 1998, : 251 - 258
  • [47] Data Locality Exploitation in Cache Compression
    Zeng, Qi
    Jha, Rakesh
    Chen, Shigang
    Peir, Jih-Kwon
    2018 IEEE 24TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS 2018), 2018, : 347 - 354
  • [48] A locality aware cache diffusion system
    Casey, John
    Zhou, Wanlei
    JOURNAL OF SUPERCOMPUTING, 2010, 52 (01): : 1 - 22
  • [49] A Phase Behavior Aware Dynamic Cache Partitioning Scheme for CMPs
    Liao, Xiaofei
    Guo, Rentong
    Yu, Danping
    Jin, Hai
    Lin, Li
    INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2016, 44 (01) : 68 - 86
  • [50] A locality aware cache diffusion system
    John Casey
    Wanlei Zhou
    The Journal of Supercomputing, 2010, 52 : 1 - 22