Scalar Waving: Improving the Efficiency of SIMD Execution on GPUs

被引：8

作者：

Yilmazer, Ayse ^{[1
]}

Chen, Zhongliang ^{[1
]}

Kaeli, David ^{[1
]}

机构：

[1] Northeastern Univ, Dept Elect & Comp Engn, Boston, MA 02115 USA

来源：

2014 IEEE 28TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM | 2014年

关键词：

GPU; SIMD Efficiency; Redundant Computation; Scalar Waving;

D O I：

10.1109/IPDPS.2014.22

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

GPUs take advantage of uniformity in program control flow and utilize SIMD execution to obtain execution efficiency. In SIMD execution, threads are batched into SIMD groups to share a common program counter and execute identical instructions on SIMD pipelines. Previous research [1] has shown that there are a significant number of scalar instructions - instructions where different threads in a SIMD group execute using the same input operands and generate the exact same output - present in a range of applications. GPUs eliminate redundant fetches and decodes by utilizing a shared common pipeline front-end. However, most GPUs do not handle scalar instruction efficiently, allowing these instructions to be redundantly executed by the threads in a SIMD group. In this paper, we propose to use scalar execution to eliminate redundant execution of scalar instructions. We introduce scalar waving as a mechanism to batch scalar operations possessing the same PC and execute them as a group on SIMD lanes for efficiency. We also propose simultaneous execution of dynamically-formed scalar waves with SIMD groups to overcome the under-utilization of SIMD lanes when encountering divergence. We evaluate our work using 22 different GPU benchmarks taken from 4 different benchmark suites. We evaluate a range of configurations using timing simulation. Our results show that scalar waving can obtain up to a 25% improvement in performance on average. Our experiments also provide insight into the amount of performance gain that we can expect with scalar waving as a function of the scalar content, occupancy, and memory characteristics of the target application.

引用

页数：10

共 50 条

[31] SIMD-X: Programming and Processing of Graph Algorithms on GPUs
Liu, Hang
Huang, H. Howie
PROCEEDINGS OF THE 2019 USENIX ANNUAL TECHNICAL CONFERENCE, 2019, : 411 - 427
[32] Collaborative design Improving efficiency by concurrent execution of Boolean tasks
Zheng, Yang
Shen, Haifeng
Sun, Chengzheng
EXPERT SYSTEMS WITH APPLICATIONS, 2011, 38 (02) : 1089 - 1098
[33] iGPU: Exception Support and Speculative Execution on GPUs
Menon, Jaikrishnan
de Kruijf, Marc
Sankaralingam, Karthikeyan
2012 39TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2012, : 72 - 83
[34] Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs
Du, Jiangsu
Jiang, Jiazhi
Zheng, Jiang
Zhang, Hongbin
Huang, Dan
Lu, Yutong
ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2023, 20 (04)
[35] Improving First Level Cache Efficiency for GPUs Using Dynamic Line Protection
Zhu, Xian
Wernsman, Robert
Zambreno, Joseph
PROCEEDINGS OF THE 47TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, 2018,
[36] Software Pipelined Execution of Stream Programs on GPUs
Udupa, Abhishek
Govindarajan, R.
Thazhuthaveetil, Matthew J.
CGO 2009: INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION, PROCEEDINGS, 2009, : 200 - 209
[37] Performance and Power Prediction for Concurrent Execution on GPUs
Moolchandani, Diksha
Kumar, Anshul
Sarangi, Smruti R.
ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2022, 19 (03)
[38] SUITABILITY OF GCM PHYSICS FOR EXECUTION ON SIMD PARALLEL COMPUTERS
ROTSTAYN, L
FRANCIS, R
ABRAMSON, D
DIX, M
JOURNAL OF THE METEOROLOGICAL SOCIETY OF JAPAN, 1993, 71 (02) : 297 - 303
[39] Efficient Execution of Graph Algorithms on CPU with SIMD Extensions
Zheng, Ruohuang
Pai, Sreepathi
CGO '21: PROCEEDINGS OF THE 2021 IEEE/ACM INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION (CGO), 2021, : 262 - 276
[40] A Multiple SIMD, Multiple Data (MSMD) Architecture: Parallel Execution of Dynamic and Static SIMD Fragments
Wang, Yaohua
Chen, Shuming
Wan, Jianghua
Meng, Jiayuan
Zhang, Kai
Liu, Wei
Ning, Xi
19TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA2013), 2013, : 603 - 614

← 1 2 3 4 5 →