Matrix-Matrix Multiplication Using Multiple GPUs Connected by Nvlink

被引:0
|
作者
Choi, Yea Rem [1 ]
Nikolskiy, Vsevolod [1 ]
Stegailov, Vladimir [2 ]
机构
[1] Natl Res Univ Higher Sch Econ, Moscow, Russia
[2] Russian Acad Sci, Joint Inst High Temp, Dolgoprudnyi, Russia
关键词
parallel computing; CUDA; GEMM; high-speed GPU interconnect;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
In this work we present an original GPU-only parallel matrix-matrix multiplication algorithm (C = alpha A * B + beta C) for servers with multiple GPUs connected by NVLink. The algorithm is implemented using CUDA. The data transfer patterns, the communication and computation overlap, and the overall performance of the algorithm are considered. By regulating the commands call order and the sizes of tiles, we tune the uninterrupted asynchronous data transmission and kernel execution. Two cases are considered: when all the data are stored in one GPU and when the matrices are distributed among several GPUs. The execution efficiency of this new algorithm is compared with cuBLAS-XT from the Nvidia CUDA Toolkit library.
引用
下载
收藏
页码:354 / 361
页数:8
相关论文
共 50 条
  • [1] Fast Kronecker Matrix-Matrix Multiplication on GPUs
    Jangda, Abhinav
    Yadav, Mohit
    PROCEEDINGS OF THE 29TH ACM SIGPLAN ANNUAL SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING, PPOPP 2024, 2024, : 390 - 403
  • [2] Efficient Symmetric Band Matrix-Matrix Multiplication on GPUs
    Dufrechou, Ernesto
    Ezzatti, Pablo
    Quintana-Orti, Enrique S.
    Remon, Alfredo
    HIGH PERFORMANCE COMPUTING, CARLA 2014, 2014, 485 : 1 - 12
  • [3] A framework for general sparse matrix-matrix multiplication on GPUs and heterogeneous processors
    Liu, Weifeng
    Vinter, Brian
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2015, 85 : 47 - 61
  • [4] Predicting optimal sparse general matrix-matrix multiplication algorithm on GPUs
    Wei, Bingxin
    Wang, Yizhuo
    Chang, Fangli
    Gao, Jianhua
    Ji, Weixing
    INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2024, 38 (03): : 245 - 259
  • [5] Register-based Implementation of the Sparse General Matrix-Matrix Multiplication on GPUs
    Liu, Junhong
    He, Xin
    Liu, Weifeng
    Tan, Guangming
    ACM SIGPLAN NOTICES, 2018, 53 (01) : 407 - 408
  • [6] TileSpGEMM: A Tiled Algorithm for Parallel Sparse General Matrix-Matrix Multiplication on GPUs
    Niu, Yuyao
    Lu, Zhengyang
    Ji, Haonan
    Song, Shuhui
    Jin, Zhou
    Liu, Weifeng
    PPOPP'22: PROCEEDINGS OF THE 27TH ACM SIGPLAN SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING, 2022, : 90 - 106
  • [7] Efficient Sparse-Dense Matrix-Matrix Multiplication on GPUs Using the Customized Sparse Storage Format
    Shi, Shaohuai
    Wang, Qiang
    Chu, Xiaowen
    2020 IEEE 26TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2020, : 19 - 26
  • [8] EXPLOITING MULTIPLE LEVELS OF PARALLELISM IN SPARSE MATRIX-MATRIX MULTIPLICATION
    Azad, Ariful
    Ballard, Grey
    Buluc, Aydin
    Demmel, James
    Grigori, Laura
    Schwartz, Oded
    Toledo, Sivan
    Williams, Samuel
    SIAM JOURNAL ON SCIENTIFIC COMPUTING, 2016, 38 (06): : C624 - C651
  • [9] TSM2: Optimizing Tall-and-Skinny Matrix-Matrix Multiplication on GPUs
    Chen, Jieyang
    Xiong, Nan
    Liang, Xin
    Tao, Dingwen
    Li, Sihuan
    Ouyang, Kaiming
    Zhao, Kai
    DeBardeleben, Nathan
    Guan, Qiang
    Chen, Zizhong
    INTERNATIONAL CONFERENCE ON SUPERCOMPUTING (ICS 2019), 2019, : 106 - 116
  • [10] Matrix-matrix multiplication on heterogeneous platforms
    Beaumont, O
    Boudet, V
    Rastello, F
    Robert, Y
    2000 INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, PROCEEDINGS, 2000, : 289 - 298