Scalar Waving: Improving the Efficiency of SIMD Execution on GPUs

被引:8
|
作者
Yilmazer, Ayse [1 ]
Chen, Zhongliang [1 ]
Kaeli, David [1 ]
机构
[1] Northeastern Univ, Dept Elect & Comp Engn, Boston, MA 02115 USA
关键词
GPU; SIMD Efficiency; Redundant Computation; Scalar Waving;
D O I
10.1109/IPDPS.2014.22
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
GPUs take advantage of uniformity in program control flow and utilize SIMD execution to obtain execution efficiency. In SIMD execution, threads are batched into SIMD groups to share a common program counter and execute identical instructions on SIMD pipelines. Previous research [1] has shown that there are a significant number of scalar instructions - instructions where different threads in a SIMD group execute using the same input operands and generate the exact same output - present in a range of applications. GPUs eliminate redundant fetches and decodes by utilizing a shared common pipeline front-end. However, most GPUs do not handle scalar instruction efficiently, allowing these instructions to be redundantly executed by the threads in a SIMD group. In this paper, we propose to use scalar execution to eliminate redundant execution of scalar instructions. We introduce scalar waving as a mechanism to batch scalar operations possessing the same PC and execute them as a group on SIMD lanes for efficiency. We also propose simultaneous execution of dynamically-formed scalar waves with SIMD groups to overcome the under-utilization of SIMD lanes when encountering divergence. We evaluate our work using 22 different GPU benchmarks taken from 4 different benchmark suites. We evaluate a range of configurations using timing simulation. Our results show that scalar waving can obtain up to a 25% improvement in performance on average. Our experiments also provide insight into the amount of performance gain that we can expect with scalar waving as a function of the scalar content, occupancy, and memory characteristics of the target application.
引用
收藏
页数:10
相关论文
共 50 条
  • [31] SIMD-X: Programming and Processing of Graph Algorithms on GPUs
    Liu, Hang
    Huang, H. Howie
    PROCEEDINGS OF THE 2019 USENIX ANNUAL TECHNICAL CONFERENCE, 2019, : 411 - 427
  • [32] Collaborative design Improving efficiency by concurrent execution of Boolean tasks
    Zheng, Yang
    Shen, Haifeng
    Sun, Chengzheng
    EXPERT SYSTEMS WITH APPLICATIONS, 2011, 38 (02) : 1089 - 1098
  • [33] iGPU: Exception Support and Speculative Execution on GPUs
    Menon, Jaikrishnan
    de Kruijf, Marc
    Sankaralingam, Karthikeyan
    2012 39TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2012, : 72 - 83
  • [34] Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs
    Du, Jiangsu
    Jiang, Jiazhi
    Zheng, Jiang
    Zhang, Hongbin
    Huang, Dan
    Lu, Yutong
    ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2023, 20 (04)
  • [35] Improving First Level Cache Efficiency for GPUs Using Dynamic Line Protection
    Zhu, Xian
    Wernsman, Robert
    Zambreno, Joseph
    PROCEEDINGS OF THE 47TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, 2018,
  • [36] Software Pipelined Execution of Stream Programs on GPUs
    Udupa, Abhishek
    Govindarajan, R.
    Thazhuthaveetil, Matthew J.
    CGO 2009: INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION, PROCEEDINGS, 2009, : 200 - 209
  • [37] Performance and Power Prediction for Concurrent Execution on GPUs
    Moolchandani, Diksha
    Kumar, Anshul
    Sarangi, Smruti R.
    ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2022, 19 (03)
  • [38] SUITABILITY OF GCM PHYSICS FOR EXECUTION ON SIMD PARALLEL COMPUTERS
    ROTSTAYN, L
    FRANCIS, R
    ABRAMSON, D
    DIX, M
    JOURNAL OF THE METEOROLOGICAL SOCIETY OF JAPAN, 1993, 71 (02) : 297 - 303
  • [39] Efficient Execution of Graph Algorithms on CPU with SIMD Extensions
    Zheng, Ruohuang
    Pai, Sreepathi
    CGO '21: PROCEEDINGS OF THE 2021 IEEE/ACM INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION (CGO), 2021, : 262 - 276
  • [40] A Multiple SIMD, Multiple Data (MSMD) Architecture: Parallel Execution of Dynamic and Static SIMD Fragments
    Wang, Yaohua
    Chen, Shuming
    Wan, Jianghua
    Meng, Jiayuan
    Zhang, Kai
    Liu, Wei
    Ning, Xi
    19TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA2013), 2013, : 603 - 614