Locality-aware CTA clustering for modern GPUs

被引:34
|
作者
Li A. [1 ]
Song S.L. [1 ]
Liu W. [2 ]
Liu X. [3 ]
Kumar A. [4 ]
Corporaal H. [5 ]
机构
[1] Pacific Northwest National Lab, Richland, WA
[2] University of Copenhagen, Copenhagen
[3] College of William and Mary, Williamsburg, VA
[4] Technische Universität Dresden, Dresden
[5] Eindhoven University of Technology, Eindhoven
来源
| 1600年 / Association for Computing Machinery, 2 Penn Plaza, Suite 701, New York, NY 10121-0701, United States卷 / 52期
关键词
Cache locality; Cta; Gpu; Performance optimization; Runtime tool;
D O I
10.1145/3037697.3037709
中图分类号
学科分类号
摘要
Cache is designed to exploit locality; however, the role of on-chip L1 data caches on modern GPUs is often awkward. The locality among global memory requests from different SMs (Streaming Multiprocessors) is predominantly harvested by the commonly-shared L2 with long access latency; while the in-core locality, which is crucial for performance delivery, is handled explicitly by user-controlled scratchpad memory. In this work, we disclose another type of data locality that has been long ignored but with performance boosting potential-the inter-CTA locality. Exploiting such locality is rather challenging due to unclear hardware feasibility, unknown and inaccessible underlying CTA scheduler, and small in-core cache capacity. To address these issues, we first conduct a thorough empirical exploration on various modern GPUs and demonstrate that inter-CTA locality can be harvested, both spatially and temporally, on L1 or L1/Tex unified cache. Through further quantification process, we prove the significance and commonality of such locality among GPU applications, and discuss whether such reuse is exploitable. By leveraging these insights, we propose the concept of CTA-Clustering and its associated software-based techniques to reshape the default CTA scheduling in order to group the CTAs with potential reuse together on the same SM. Our techniques require no hardware modification and can be directly deployed on existing GPUs. In addition, we incorporate these techniques into an integrated framework for automatic inter-CTA locality optimization. We evaluate our techniques using a wide range of popular GPU applications on all modern generations of NVIDIA GPU architectures. The results show that our proposed techniques significantly improve cache performance through reducing L2 cache transactions by 55%, 65%, 29%, 28% on average for Fermi, Kepler, Maxwell and Pascal, respectively, leading to an average of 1.46x, 1.48x, 1.45x, 1.41x (up to 3.8x, 3.6x, 3.1x, 3.3x) performance speedups for applications with algorithm-related inter-CTA reuse. © 2017 ACM.
引用
收藏
页码:297 / 311
页数:14
相关论文
共 50 条
  • [1] Locality-Aware CTA Clustering for Modern GPUs
    Li, Ang
    Song, Shuaiwen Leon
    Liu, Weifeng
    Liu, Xu
    Kumar, Akash
    Corporaal, Henk
    [J]. OPERATING SYSTEMS REVIEW, 2017, 51 (02) : 297 - 311
  • [2] Locality-Aware CTA Clustering for Modern GPUs
    Li, Ang
    Song, Shuaiwen Leon
    Liu, Weifeng
    Liu, Xu
    Kumar, Akash
    Corporaal, Henk
    [J]. TWENTY-SECOND INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS (ASPLOS XXII), 2017, : 297 - 311
  • [3] Locality-Aware CTA Clustering for Modern GPUs
    Li, Ang
    Song, Shuaiwen Leon
    Liu, Weifeng
    Liu, Xu
    Kumar, Akash
    Corporaal, Henk
    [J]. ACM SIGPLAN NOTICES, 2017, 52 (04) : 297 - 311
  • [4] Locality-Aware CTA Scheduling for Gaming Applications
    Ukarande, Aditya
    Patidar, Suryakant
    Rangan, Ram
    [J]. ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2022, 19 (01)
  • [5] A Study of the Potential of Locality-Aware Thread Scheduling for GPUs
    Nugteren, Cedric
    van den Braak, Gert-Jan
    Corporaal, Henk
    [J]. EURO-PAR 2014: PARALLEL PROCESSING WORKSHOPS, PT II, 2014, 8806 : 146 - 157
  • [6] Locality-Aware Mapping of Nested Parallel Patterns on GPUs
    Lee, HyoukJoong
    Brown, Kevin J.
    Sujeeth, Arvind K.
    Rompf, Tiark
    Olukotun, Kunle
    [J]. 2014 47TH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO), 2014, : 63 - 74
  • [7] Locality-Aware Task-Parallel Execution on GPUs
    Hbeika, Jad
    Kulkarni, Milind
    [J]. LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING, LCPC 2016, 2017, 10136 : 250 - 264
  • [8] Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs
    Chen, Yanhao
    Hayes, Ari B.
    Zhang, Chi
    Salmon, Timothy
    Zhang, Eddy Z.
    [J]. PROCEEDINGS OF THE 2018 USENIX ANNUAL TECHNICAL CONFERENCE, 2018, : 413 - 425
  • [9] LAS: Locality-Aware Scheduling for GEMM-Accelerated Convolutions in GPUs
    Kim, Hyeonjin
    Song, William J.
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2023, 34 (05) : 1479 - 1494
  • [10] Locality-Aware Crowd Counting
    Zhou, Joey Tianyi
    Le Zhang
    Du Jiawei
    Xi Peng
    Fang, Zhiwen
    Zhe Xiao
    Zhu, Hongyuan
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (07) : 3602 - 3613