Locality-aware CTA clustering for modern GPUs

被引：34

作者：

Li A. ^{[1
]}

Song S.L. ^{[1
]}

Liu W. ^{[2
]}

Liu X. ^{[3
]}

Kumar A. ^{[4
]}

Corporaal H. ^{[5
]}

机构：

[1] Pacific Northwest National Lab, Richland, WA

[2] University of Copenhagen, Copenhagen

[3] College of William and Mary, Williamsburg, VA

[4] Technische Universität Dresden, Dresden

[5] Eindhoven University of Technology, Eindhoven

来源：

| 1600年 / Association for Computing Machinery, 2 Penn Plaza, Suite 701, New York, NY 10121-0701, United States卷 / 52期

关键词：

Cache locality; Cta; Gpu; Performance optimization; Runtime tool;

D O I：

10.1145/3037697.3037709

中图分类号：

学科分类号：

摘要：

Cache is designed to exploit locality; however, the role of on-chip L1 data caches on modern GPUs is often awkward. The locality among global memory requests from different SMs (Streaming Multiprocessors) is predominantly harvested by the commonly-shared L2 with long access latency; while the in-core locality, which is crucial for performance delivery, is handled explicitly by user-controlled scratchpad memory. In this work, we disclose another type of data locality that has been long ignored but with performance boosting potential-the inter-CTA locality. Exploiting such locality is rather challenging due to unclear hardware feasibility, unknown and inaccessible underlying CTA scheduler, and small in-core cache capacity. To address these issues, we first conduct a thorough empirical exploration on various modern GPUs and demonstrate that inter-CTA locality can be harvested, both spatially and temporally, on L1 or L1/Tex unified cache. Through further quantification process, we prove the significance and commonality of such locality among GPU applications, and discuss whether such reuse is exploitable. By leveraging these insights, we propose the concept of CTA-Clustering and its associated software-based techniques to reshape the default CTA scheduling in order to group the CTAs with potential reuse together on the same SM. Our techniques require no hardware modification and can be directly deployed on existing GPUs. In addition, we incorporate these techniques into an integrated framework for automatic inter-CTA locality optimization. We evaluate our techniques using a wide range of popular GPU applications on all modern generations of NVIDIA GPU architectures. The results show that our proposed techniques significantly improve cache performance through reducing L2 cache transactions by 55%, 65%, 29%, 28% on average for Fermi, Kepler, Maxwell and Pascal, respectively, leading to an average of 1.46x, 1.48x, 1.45x, 1.41x (up to 3.8x, 3.6x, 3.1x, 3.3x) performance speedups for applications with algorithm-related inter-CTA reuse. © 2017 ACM.

引用

页码：297 / 311

页数：14

共 50 条

[1] Locality-Aware CTA Clustering for Modern GPUs
Li, Ang
Song, Shuaiwen Leon
Liu, Weifeng
Liu, Xu
Kumar, Akash
Corporaal, Henk
[J]. OPERATING SYSTEMS REVIEW, 2017, 51 (02) : 297 - 311
[2] Locality-Aware CTA Clustering for Modern GPUs
Li, Ang
Song, Shuaiwen Leon
Liu, Weifeng
Liu, Xu
Kumar, Akash
Corporaal, Henk
[J]. TWENTY-SECOND INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS (ASPLOS XXII), 2017, : 297 - 311
[3] Locality-Aware CTA Clustering for Modern GPUs
Li, Ang
Song, Shuaiwen Leon
Liu, Weifeng
Liu, Xu
Kumar, Akash
Corporaal, Henk
[J]. ACM SIGPLAN NOTICES, 2017, 52 (04) : 297 - 311
[4] Locality-Aware CTA Scheduling for Gaming Applications
Ukarande, Aditya
Patidar, Suryakant
Rangan, Ram
[J]. ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2022, 19 (01)
[5] A Study of the Potential of Locality-Aware Thread Scheduling for GPUs
Nugteren, Cedric
van den Braak, Gert-Jan
Corporaal, Henk
[J]. EURO-PAR 2014: PARALLEL PROCESSING WORKSHOPS, PT II, 2014, 8806 : 146 - 157
[6] Locality-Aware Mapping of Nested Parallel Patterns on GPUs
Lee, HyoukJoong
Brown, Kevin J.
Sujeeth, Arvind K.
Rompf, Tiark
Olukotun, Kunle
[J]. 2014 47TH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO), 2014, : 63 - 74
[7] Locality-Aware Task-Parallel Execution on GPUs
Hbeika, Jad
Kulkarni, Milind
[J]. LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING, LCPC 2016, 2017, 10136 : 250 - 264
[8] Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs
Chen, Yanhao
Hayes, Ari B.
Zhang, Chi
Salmon, Timothy
Zhang, Eddy Z.
[J]. PROCEEDINGS OF THE 2018 USENIX ANNUAL TECHNICAL CONFERENCE, 2018, : 413 - 425
[9] LAS: Locality-Aware Scheduling for GEMM-Accelerated Convolutions in GPUs
Kim, Hyeonjin
Song, William J.
[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2023, 34 (05) : 1479 - 1494
[10] Locality-Aware Crowd Counting
Zhou, Joey Tianyi
Le Zhang
Du Jiawei
Xi Peng
Fang, Zhiwen
Zhe Xiao
Zhu, Hongyuan
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (07) : 3602 - 3613

← 1 2 3 4 5 →