Scalar Waving: Improving the Efficiency of SIMD Execution on GPUs

被引：8

作者：

Yilmazer, Ayse ^{[1
]}

Chen, Zhongliang ^{[1
]}

Kaeli, David ^{[1
]}

机构：

[1] Northeastern Univ, Dept Elect & Comp Engn, Boston, MA 02115 USA

来源：

2014 IEEE 28TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM | 2014年

关键词：

GPU; SIMD Efficiency; Redundant Computation; Scalar Waving;

D O I：

10.1109/IPDPS.2014.22

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

GPUs take advantage of uniformity in program control flow and utilize SIMD execution to obtain execution efficiency. In SIMD execution, threads are batched into SIMD groups to share a common program counter and execute identical instructions on SIMD pipelines. Previous research [1] has shown that there are a significant number of scalar instructions - instructions where different threads in a SIMD group execute using the same input operands and generate the exact same output - present in a range of applications. GPUs eliminate redundant fetches and decodes by utilizing a shared common pipeline front-end. However, most GPUs do not handle scalar instruction efficiently, allowing these instructions to be redundantly executed by the threads in a SIMD group. In this paper, we propose to use scalar execution to eliminate redundant execution of scalar instructions. We introduce scalar waving as a mechanism to batch scalar operations possessing the same PC and execute them as a group on SIMD lanes for efficiency. We also propose simultaneous execution of dynamically-formed scalar waves with SIMD groups to overcome the under-utilization of SIMD lanes when encountering divergence. We evaluate our work using 22 different GPU benchmarks taken from 4 different benchmark suites. We evaluate a range of configurations using timing simulation. Our results show that scalar waving can obtain up to a 25% improvement in performance on average. Our experiments also provide insight into the amount of performance gain that we can expect with scalar waving as a function of the scalar content, occupancy, and memory characteristics of the target application.

引用

页数：10

共 50 条

[41] C-for-Metal: High Performance SIMD Programming on Intel GPUs
Lueh, Guei-Yuan
Chen, Kaiyu
Chen, Gang
Fuentes, Joel
Chen, Wei-Yu
Fu, Fangwen
Jiang, Hong
Li, Hongzheng
Rhee, Daniel
CGO '21: PROCEEDINGS OF THE 2021 IEEE/ACM INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION (CGO), 2021, : 289 - 300
[42] Visibility Rendering Order: Improving Energy Efficiency on Mobile GPUs through Frame Coherence
de Lucas, Enrique
Marcuello, Pedro
Parcerisa, Joan-Manuel
Gonzalez, Antonio
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2019, 30 (02) : 473 - 485
[43] Building a Lightweight Trusted Execution Environment for Arm GPUs
Wang, Chenxu
Deng, Yunjie
Ning, Zhenyu
Leach, Kevin
Li, Jin
Yan, Shoumeng
He, Zhengyu
Cao, Jiannong
Zhang, Fengwei
IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2024, 21 (04) : 3801 - 3816
[44] Improving CADNA performance on GPUs
Eberhart, P.
Landreau, B.
Brajard, J.
Fortin, P.
Jezequel, F.
2018 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW 2018), 2018, : 1016 - 1025
[45] Scalar Processing Overhead on SIMD-Only Architectures
Azevedo, Arnaldo
Juurlink, Ben
2009 20TH IEEE INTERNATIONAL CONFERENCE ON APPLICATION-SPECIFIC SYSTEMS, ARCHITECTURES AND PROCESSORS, 2009, : 183 - 190
[46] Sharing SIMD execution units with decoupled offloader in asymmetric multicores
Caio Vieira
Antonio Carlos Schneider Beck
Analog Integrated Circuits and Signal Processing, 2022, 112 : 263 - 275
[47] Warp-Consolidation: A Novel Execution Model for GPUs
Li, Ang
Liu, Weifeng
Wang, Linnan
Barker, Kevin
Song, Shuaiwen Leon
INTERNATIONAL CONFERENCE ON SUPERCOMPUTING (ICS 2018), 2018, : 53 - 64
[48] MIMD Programs Execution Support on SIMD Machines: A Holistic Survey
Mustafa, Dheya
Alkhasawneh, Ruba
Obeidat, Fadi
Shatnawi, Ahmed S.
IEEE ACCESS, 2024, 12 : 34354 - 34377
[49] Improving SIMD Code Generation in QEMU
Fu, Sheng-Yu
Wu, Jan-Jan
Hsu, Wei-Chung
2015 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE), 2015, : 1233 - 1236
[50] Sharing SIMD execution units with decoupled offloader in asymmetric multicores
Vieira, Caio
Beck, Antonio Carlos Schneider
ANALOG INTEGRATED CIRCUITS AND SIGNAL PROCESSING, 2022, 112 (02) : 263 - 275

← 1 2 3 4 5 →