An enhanced GPU reduction at the warp-level

被引:0
|
作者
Hou Neng [1 ]
He Fazhi [1 ]
Zhou Yi [1 ]
机构
[1] School of Computer Science and Technology, Wuhan University
关键词
reduction; graphical processing unit; computing unified device architecture; warp-level reduction;
D O I
10.19583/j.1003-4951.2016.02.007
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years, graphical processing unit(GPU)-accelerated intelligent algorithms have been widely utilized for solving combination optimization problems, which are NP-hard. These intelligent algorithms involves a common operation, namely reduction, in which the best suitable candidate solution in the neighborhood is selected. As one of the main procedures, it is necessary to optimize the reduction on the GPU. In this paper, we propose an enhanced warp-based reduction on the GPU. Compared with existing block-based reduction methods, our method exploit efficiently the potential of implementation at warp level, which better matches the characteristics of current GPU architecture. Firstly, in order to improve the global memory access performance, the vectoring accessing is utilized. Secondly, at the level of thread block reduction, an enhanced warp-based reduction on the shared memory are presented to form partial results. Thirdly, for the configuration of the number of thread blocks, the number of thread blocks can be obtained by maximizing the size of thread block and the maximum size of threads per stream multi-processor on GPU. Finally, the proposed method is evaluated on three generations of NVIDIA GPUs with the better performances than previous methods.
引用
收藏
页码:43 / 52
页数:10
相关论文
共 50 条
  • [1] Warp-Level Parallelism: Enabling Multiple Replications In Parallel on GPU
    Passerat-Palmbach, Jonathan
    Caux, Jonathan
    Siregar, Pridi
    Mazel, Claude
    Hill, D. R. C.
    [J]. EUROPEAN SIMULATION AND MODELLING CONFERENCE 2011, 2011, : 76 - +
  • [2] COX: Exposing CUDA Warp-level Functions to CPUs
    Han, Ruobing
    Lee, Jaewon
    Sim, Jaewoong
    Kim, Hyesoon
    [J]. ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2022, 19 (04)
  • [3] Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation
    Xiang, Ping
    Yang, Yi
    Zhou, Huiyang
    [J]. 2014 20TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA-20), 2014, : 284 - 295
  • [4] Automatic Generation of Warp-Level Primitives and Atomic Instructions for Fast and Portable Parallel Reduction on GPUs
    De Gonzalo, Simon Garcia
    Huang, Sitao
    Gomez-Luna, Juan
    Hammond, Simon
    Mutlu, Onur
    Hwu, Wen-mei
    [J]. PROCEEDINGS OF THE 2019 IEEE/ACM INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION (CGO '19), 2019, : 73 - 84
  • [5] Reconciling QoS and Concurrency in NVIDIA GPUs via Warp-Level Scheduling
    Singh, Jayati
    Olmedo, Ignacio Sanudo
    Capodieci, Nicola
    Marongiu, Andrea
    Caccamo, Marco
    [J]. PROCEEDINGS OF THE 2022 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE 2022), 2022, : 1275 - 1280
  • [6] Benchmarking the GPU memory at the warp level
    Fang, Minquan
    Fang, Jianbin
    Zhang, Weimin
    Zhou, Haifang
    Liao, Jianxing
    Wang, Yuangang
    [J]. PARALLEL COMPUTING, 2018, 71 : 23 - 41
  • [7] YuenyeungSpTRSV: A Thread-Level and Warp-Level Fusion Synchronization-Free Sparse Triangular Solve
    Zhang, Feng
    Su, Jiya
    Liu, Weifeng
    He, Bingsheng
    Wu, Ruofan
    Du, Xiaoyong
    Wang, Rujia
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2021, 32 (09) : 2321 - 2337
  • [8] Time Warp on the GPU: Design and Assessment
    Liu, Xinhu
    Andelfinger, Philipp
    [J]. SIGSIM-PADS'17: PROCEEDINGS OF THE 2017 ACM SIGSIM CONFERENCE ON PRINCIPLES OF ADVANCED DISCRETE SIMULATION, 2017, : 109 - 120
  • [9] Improving GPU Performance via Large Warps and Two-Level Warp Scheduling
    Narasiman, Veynu
    Shebanow, Michael
    Lee, Chang Joo
    Miftakhutdinov, Rustam
    Mutlu, Onur
    Patt, Yale N.
    [J]. PROCEEDINGS OF THE 2011 44TH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO 44), 2011, : 308 - 317
  • [10] A Model-Driven Approach to Warp/Thread-Block Level GPU Cache Bypassing
    Dai, Hongwen
    Li, Chao
    Zhou, Huiyang
    Gupta, Saurabh
    Kartsaklis, Christos
    Mantor, Mike
    [J]. 2016 ACM/EDAC/IEEE DESIGN AUTOMATION CONFERENCE (DAC), 2016,