Optimizing parallel GEMM routines using auto-tuning with Intel AVX-512

被引：16

作者：

Kim, Raehyun ^{[1
]}

Choi, Jaeyoung ^{[1
]}

Lee, Myungho ^{[2
]}

机构：

[1] Soongsil Univ, Seoul, South Korea

[2] Myongji Univ, Yongin, Gyeonggi, South Korea

来源：

PROCEEDINGS OF INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING IN ASIA-PACIFIC REGION (HPC ASIA 2019) | 2019年

关键词：

Manycore; Intel Xeon; Intel Xeon Phi; Autotuning; matrix-matrix multiplication; AVX-512;

D O I：

10.1145/3293320.3293334

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

This paper presents the optimal implementations of single-and double-precision general matrix-matrix multiplication (GEMM) routines for the Intel Xeon Phi Processor code-named Knights Landing (KNL) and the Intel Xeon Scalable Processors based on an auto-tuning approach with the Intel AVX-512 intrinsic functions. Our auto-tuning approach precisely determines the parameters reflecting the target architectural features. Our approach significantly reduces the search space and derives optimal parameter sets including the size of submatrices, prefetch distances, loop unrolling depth, and parallelization scheme. Without a single line of assembly code, our GEMM kernels show the comparable performance results to the Intel MKL and outperform other open-source BLAS libraries.

引用

页码：101 / 110

页数：10

共 50 条

[21] Auto-Tuning GEMM Kernels for a Decoupled Access/Execute Architecture Processor
Zhao, Zeng
Gu, Naijie
Yang, Yangzhao
2013 FIRST INTERNATIONAL SYMPOSIUM ON COMPUTING AND NETWORKING (CANDAR), 2013, : 233 - 239
[22] Tensile: Auto-tuning GEMM GPU Assembly for All Problem Sizes
Tanner, David E.
2018 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW 2018), 2018, : 1066 - 1075
[23] Efficient Auto-Tuning of Parallel Programs with Interdependent Tuning Parameters via Auto-Tuning Framework (ATF)
Rasch, Ari
Schulze, Richard
Steuwer, Michel
Gorlatch, Sergei
ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2021, 18 (01)
[24] Optimizing and Auto-tuning Belief Propagation on the GPU
Grauer-Gray, Scott
Cavazos, John
LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING, 2011, 6548 : 121 - 135
[25] SWIMM 2.0: Enhanced Smith–Waterman on Intel’s Multicore and Manycore Architectures Based on AVX-512 Vector Extensions
Enzo Rucci
Carlos Garcia Sanchez
Guillermo Botella Juan
Armando De Giusti
Marcelo Naiouf
Manuel Prieto-Matias
International Journal of Parallel Programming, 2019, 47 : 296 - 316
[26] Auto-tuning techniques for linear algebra routines on hybrid platforms
Bernabe, Gregorio
Cuenca, Javier
Garcia, Luis-Pedro
Gimenez, Domingo
JOURNAL OF COMPUTATIONAL SCIENCE, 2015, 10 : 299 - 310
[27] A new AXT format for an efficient SpMV product using AVX-512 instructions and CUDA
Coronado-Barrientos, E.
Antonioletti, M.
Garcia-Loureiro, A.
ADVANCES IN ENGINEERING SOFTWARE, 2021, 156
[28] Vectorization of High-performance Scientific Calculations Using AVX-512 Intruction Set
B. M. Shabanov
A. A. Rybakov
S. S. Shumilin
Lobachevskii Journal of Mathematics, 2019, 40 : 580 - 598
[29] Parallel GMRES Incomplete Orthogonalization Auto-Tuning
Aquilanti, Pierre-Yves
Petiton, Serge
Calandra, Henri
PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE (ICCS), 2011, 4 : 2246 - 2256
[30] Vectorization of High-performance Scientific Calculations Using AVX-512 Intruction Set
Shabanov, B. M.
Rybakov, A. A.
Shumilin, S. S.
LOBACHEVSKII JOURNAL OF MATHEMATICS, 2019, 40 (05) : 580 - 598

← 1 2 3 4 5 →