Exploration of automatic optimisation for CUDA programming

被引：2

作者：

Al-Mouhamed, Mayez ^{[1
]}

ul Hassan Khan, Ayaz ^{[1
]}

机构：

[1] King Fahd Univ Petr & Minerals, Dept Comp Engn, Dhahran, Saudi Arabia

来源：

INTERNATIONAL JOURNAL OF PARALLEL EMERGENT AND DISTRIBUTED SYSTEMS | 2015年 / 30卷 / 04期

关键词：

CUDA; GPU; parallel programming; compiler transformations; directive-based language; source-to-source compiler; GPGPU;

D O I：

10.1080/17445760.2014.953158

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Writing optimised compute unified device architecture (CUDA) program for graphic processing units (GPUs) is complex even for experts. We present a design methodology for a restructuring tool that converts C-loops into optimised CUDA kernels based on a three-step algorithm which are loop tiling, coalesced memory access and resource optimisation. A method for finding possible loop tiling solutions with coalesced memory access is developed and a simplified algorithm for restructuring C-loops into an efficient CUDA kernel is presented. In the evaluation, we implement matrix multiply (MM), matrix transpose (M-transpose), matrix scaling (M-scaling) and matrix vector multiply (MV) using the proposed algorithm. We present the analysis of the execution time and GPU throughput for the above applications, which favourably compare to other proposals. Evaluation is carried out while scaling the problem size and running under a variety of kernel configurations. The obtained speedup is about 28-35% for M-transpose compared to NVIDIA Software Development Kit, 33% speedup for MV compared to general purpose computation on graphics processing unit compiler, and more than 80% speedup forMMand M-scaling compared to CUDA-lite.

引用

页码：309 / 324

页数：16

共 50 条

[1] Exploration of Automatic Optimization for CUDA Programming
Al-Mouhamed, Mayez
Khan, Ayaz ul Hassan
[J]. 2012 2ND IEEE INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND GRID COMPUTING (PDGC), 2012, : 55 - 60
[2] A CUDA programming toolkit on grids
Liang, Tyng-Yeu
Chang, Yu-Wei
Li, Hung-Fu
[J]. INTERNATIONAL JOURNAL OF GRID AND UTILITY COMPUTING, 2012, 3 (2-3) : 97 - 111
[3] CUDA memory optimisation strategies for motion estimation
Sayadi, Fatma Elzahra
Chouchene, Marwa
Bahri, Haithem
Khemiri, Randa
Atri, Mohamed
[J]. IET COMPUTERS AND DIGITAL TECHNIQUES, 2019, 13 (01): : 20 - 27
[4] GSGP-CUDA-A CUDA framework for Geometric Semantic Genetic Programming
Trujillo, Leonardo
Munoz Contreras, Jose Manuel
Hernandez, Daniel E.
Castelli, Mauro
Tapia, Juan J.
[J]. SOFTWAREX, 2022, 18
[5] Structural testing for CUDA programming model
Luz, Helder J. F.
Souza, Paulo S. L.
Souza, Simone R. S.
[J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2024, 36 (14):
[6] CUDABlock: A GUI Programming Tool for CUDA
Lin, Hsih-Hsin
Tu, Chia-Heng
Hwang, Yuan-Shin
[J]. 2015 44TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING WORKSHOPS, 2015, : 37 - 42
[7] Web applications for learning CUDA programming
Wakatani, Akiyoshi
Maeda, Toshiyuki
[J]. 2017 8TH INTERNATIONAL CONFERENCE ON INFORMATION, INTELLIGENCE, SYSTEMS & APPLICATIONS (IISA), 2017, : 571 - 575
[8] CUDA-enabled Optimisation of Technical Analysis Parameters
O'Rourke, John
Burns, John
[J]. 2012 IEEE/ACM 16TH INTERNATIONAL SYMPOSIUM ON DISTRIBUTED SIMULATION AND REAL TIME APPLICATIONS (DS-RT), 2012, : 221 - 227
[9] GPUBlocks: GUI Programming Tool for CUDA and OpenCL
Yuan-Shin Hwang
Hsih-Hsin Lin
Shen-Hung Pai
Chia-Heng Tu
[J]. Journal of Signal Processing Systems, 2019, 91 : 235 - 245
[10] CUDAMicroBench: Microbenchmarks to Assist CUDA Performance Programming
Yi, Xinyao
Stokes, David
Yan, Yonghong
Liao, Chunhua
[J]. 2021 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2021, : 397 - 406

← 1 2 3 4 5 →