Optimizing parallel GEMM routines using auto-tuning with Intel AVX-512

被引:16
|
作者
Kim, Raehyun [1 ]
Choi, Jaeyoung [1 ]
Lee, Myungho [2 ]
机构
[1] Soongsil Univ, Seoul, South Korea
[2] Myongji Univ, Yongin, Gyeonggi, South Korea
关键词
Manycore; Intel Xeon; Intel Xeon Phi; Autotuning; matrix-matrix multiplication; AVX-512;
D O I
10.1145/3293320.3293334
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
This paper presents the optimal implementations of single-and double-precision general matrix-matrix multiplication (GEMM) routines for the Intel Xeon Phi Processor code-named Knights Landing (KNL) and the Intel Xeon Scalable Processors based on an auto-tuning approach with the Intel AVX-512 intrinsic functions. Our auto-tuning approach precisely determines the parameters reflecting the target architectural features. Our approach significantly reduces the search space and derives optimal parameter sets including the size of submatrices, prefetch distances, loop unrolling depth, and parallelization scheme. Without a single line of assembly code, our GEMM kernels show the comparable performance results to the Intel MKL and outperform other open-source BLAS libraries.
引用
收藏
页码:101 / 110
页数:10
相关论文
共 50 条
  • [21] Auto-Tuning GEMM Kernels for a Decoupled Access/Execute Architecture Processor
    Zhao, Zeng
    Gu, Naijie
    Yang, Yangzhao
    2013 FIRST INTERNATIONAL SYMPOSIUM ON COMPUTING AND NETWORKING (CANDAR), 2013, : 233 - 239
  • [22] Tensile: Auto-tuning GEMM GPU Assembly for All Problem Sizes
    Tanner, David E.
    2018 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW 2018), 2018, : 1066 - 1075
  • [23] Efficient Auto-Tuning of Parallel Programs with Interdependent Tuning Parameters via Auto-Tuning Framework (ATF)
    Rasch, Ari
    Schulze, Richard
    Steuwer, Michel
    Gorlatch, Sergei
    ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2021, 18 (01)
  • [24] Optimizing and Auto-tuning Belief Propagation on the GPU
    Grauer-Gray, Scott
    Cavazos, John
    LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING, 2011, 6548 : 121 - 135
  • [25] SWIMM 2.0: Enhanced Smith–Waterman on Intel’s Multicore and Manycore Architectures Based on AVX-512 Vector Extensions
    Enzo Rucci
    Carlos Garcia Sanchez
    Guillermo Botella Juan
    Armando De Giusti
    Marcelo Naiouf
    Manuel Prieto-Matias
    International Journal of Parallel Programming, 2019, 47 : 296 - 316
  • [26] Auto-tuning techniques for linear algebra routines on hybrid platforms
    Bernabe, Gregorio
    Cuenca, Javier
    Garcia, Luis-Pedro
    Gimenez, Domingo
    JOURNAL OF COMPUTATIONAL SCIENCE, 2015, 10 : 299 - 310
  • [27] A new AXT format for an efficient SpMV product using AVX-512 instructions and CUDA
    Coronado-Barrientos, E.
    Antonioletti, M.
    Garcia-Loureiro, A.
    ADVANCES IN ENGINEERING SOFTWARE, 2021, 156
  • [28] Vectorization of High-performance Scientific Calculations Using AVX-512 Intruction Set
    B. M. Shabanov
    A. A. Rybakov
    S. S. Shumilin
    Lobachevskii Journal of Mathematics, 2019, 40 : 580 - 598
  • [29] Parallel GMRES Incomplete Orthogonalization Auto-Tuning
    Aquilanti, Pierre-Yves
    Petiton, Serge
    Calandra, Henri
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE (ICCS), 2011, 4 : 2246 - 2256
  • [30] Vectorization of High-performance Scientific Calculations Using AVX-512 Intruction Set
    Shabanov, B. M.
    Rybakov, A. A.
    Shumilin, S. S.
    LOBACHEVSKII JOURNAL OF MATHEMATICS, 2019, 40 (05) : 580 - 598