Scalar Waving: Improving the Efficiency of SIMD Execution on GPUs

被引:8
|
作者
Yilmazer, Ayse [1 ]
Chen, Zhongliang [1 ]
Kaeli, David [1 ]
机构
[1] Northeastern Univ, Dept Elect & Comp Engn, Boston, MA 02115 USA
关键词
GPU; SIMD Efficiency; Redundant Computation; Scalar Waving;
D O I
10.1109/IPDPS.2014.22
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
GPUs take advantage of uniformity in program control flow and utilize SIMD execution to obtain execution efficiency. In SIMD execution, threads are batched into SIMD groups to share a common program counter and execute identical instructions on SIMD pipelines. Previous research [1] has shown that there are a significant number of scalar instructions - instructions where different threads in a SIMD group execute using the same input operands and generate the exact same output - present in a range of applications. GPUs eliminate redundant fetches and decodes by utilizing a shared common pipeline front-end. However, most GPUs do not handle scalar instruction efficiently, allowing these instructions to be redundantly executed by the threads in a SIMD group. In this paper, we propose to use scalar execution to eliminate redundant execution of scalar instructions. We introduce scalar waving as a mechanism to batch scalar operations possessing the same PC and execute them as a group on SIMD lanes for efficiency. We also propose simultaneous execution of dynamically-formed scalar waves with SIMD groups to overcome the under-utilization of SIMD lanes when encountering divergence. We evaluate our work using 22 different GPU benchmarks taken from 4 different benchmark suites. We evaluate a range of configurations using timing simulation. Our results show that scalar waving can obtain up to a 25% improvement in performance on average. Our experiments also provide insight into the amount of performance gain that we can expect with scalar waving as a function of the scalar content, occupancy, and memory characteristics of the target application.
引用
收藏
页数:10
相关论文
共 50 条
  • [41] C-for-Metal: High Performance SIMD Programming on Intel GPUs
    Lueh, Guei-Yuan
    Chen, Kaiyu
    Chen, Gang
    Fuentes, Joel
    Chen, Wei-Yu
    Fu, Fangwen
    Jiang, Hong
    Li, Hongzheng
    Rhee, Daniel
    CGO '21: PROCEEDINGS OF THE 2021 IEEE/ACM INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION (CGO), 2021, : 289 - 300
  • [42] Visibility Rendering Order: Improving Energy Efficiency on Mobile GPUs through Frame Coherence
    de Lucas, Enrique
    Marcuello, Pedro
    Parcerisa, Joan-Manuel
    Gonzalez, Antonio
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2019, 30 (02) : 473 - 485
  • [43] Building a Lightweight Trusted Execution Environment for Arm GPUs
    Wang, Chenxu
    Deng, Yunjie
    Ning, Zhenyu
    Leach, Kevin
    Li, Jin
    Yan, Shoumeng
    He, Zhengyu
    Cao, Jiannong
    Zhang, Fengwei
    IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2024, 21 (04) : 3801 - 3816
  • [44] Improving CADNA performance on GPUs
    Eberhart, P.
    Landreau, B.
    Brajard, J.
    Fortin, P.
    Jezequel, F.
    2018 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW 2018), 2018, : 1016 - 1025
  • [45] Scalar Processing Overhead on SIMD-Only Architectures
    Azevedo, Arnaldo
    Juurlink, Ben
    2009 20TH IEEE INTERNATIONAL CONFERENCE ON APPLICATION-SPECIFIC SYSTEMS, ARCHITECTURES AND PROCESSORS, 2009, : 183 - 190
  • [46] Sharing SIMD execution units with decoupled offloader in asymmetric multicores
    Caio Vieira
    Antonio Carlos Schneider Beck
    Analog Integrated Circuits and Signal Processing, 2022, 112 : 263 - 275
  • [47] Warp-Consolidation: A Novel Execution Model for GPUs
    Li, Ang
    Liu, Weifeng
    Wang, Linnan
    Barker, Kevin
    Song, Shuaiwen Leon
    INTERNATIONAL CONFERENCE ON SUPERCOMPUTING (ICS 2018), 2018, : 53 - 64
  • [48] MIMD Programs Execution Support on SIMD Machines: A Holistic Survey
    Mustafa, Dheya
    Alkhasawneh, Ruba
    Obeidat, Fadi
    Shatnawi, Ahmed S.
    IEEE ACCESS, 2024, 12 : 34354 - 34377
  • [49] Improving SIMD Code Generation in QEMU
    Fu, Sheng-Yu
    Wu, Jan-Jan
    Hsu, Wei-Chung
    2015 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE), 2015, : 1233 - 1236
  • [50] Sharing SIMD execution units with decoupled offloader in asymmetric multicores
    Vieira, Caio
    Beck, Antonio Carlos Schneider
    ANALOG INTEGRATED CIRCUITS AND SIGNAL PROCESSING, 2022, 112 (02) : 263 - 275