Matrix-Matrix Multiplication Using Multiple GPUs Connected by Nvlink

被引:0
|
作者
Choi, Yea Rem [1 ]
Nikolskiy, Vsevolod [1 ]
Stegailov, Vladimir [2 ]
机构
[1] Natl Res Univ Higher Sch Econ, Moscow, Russia
[2] Russian Acad Sci, Joint Inst High Temp, Dolgoprudnyi, Russia
关键词
parallel computing; CUDA; GEMM; high-speed GPU interconnect;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
In this work we present an original GPU-only parallel matrix-matrix multiplication algorithm (C = alpha A * B + beta C) for servers with multiple GPUs connected by NVLink. The algorithm is implemented using CUDA. The data transfer patterns, the communication and computation overlap, and the overall performance of the algorithm are considered. By regulating the commands call order and the sizes of tiles, we tune the uninterrupted asynchronous data transmission and kernel execution. Two cases are considered: when all the data are stored in one GPU and when the matrices are distributed among several GPUs. The execution efficiency of this new algorithm is compared with cuBLAS-XT from the Nvidia CUDA Toolkit library.
引用
收藏
页码:354 / 361
页数:8
相关论文
共 50 条
  • [41] Partitioning Models for Scaling Parallel Sparse Matrix-Matrix Multiplication
    Akbudak, Kadir
    Selvitopi, Oguz
    Aykanat, Cevdet
    ACM TRANSACTIONS ON PARALLEL COMPUTING, 2018, 4 (03)
  • [42] Parallel Algorithm for Quasi-Band Matrix-Matrix Multiplication
    Vooturi, Dharma Teja
    Kothapalli, Kishore
    PARALLEL PROCESSING AND APPLIED MATHEMATICS, PPAM 2015, PT I, 2016, 9573 : 106 - 115
  • [43] Fast Compressive Large-Scale Matrix-Matrix Multiplication Using Product Codes
    Ocal, Orhan
    Ramchandran, Kannan
    2020 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY (ISIT), 2020, : 1426 - 1431
  • [44] On Large-Scale Matrix-Matrix Multiplication On Compressed Structures
    Krishna, Sudhindra Gopal
    Narasimhan, Aditya
    Radhakrishnan, Sridhar
    Veras, Richard
    2021 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2021, : 2976 - 2985
  • [45] A Matrix-Matrix Multiplication methodology for single/multi-core architectures using SIMD
    Kelefouras, Vasilios
    Kritikakou, Angeliki
    Goutis, Costas
    JOURNAL OF SUPERCOMPUTING, 2014, 68 (03): : 1418 - 1440
  • [46] Sparse approximate matrix-matrix multiplication for density matrix purification with error control
    Artemov, Anton G.
    Rubensson, Emanuel H.
    JOURNAL OF COMPUTATIONAL PHYSICS, 2021, 438
  • [47] Bandwidth Optimized Parallel Algorithms for Sparse Matrix-Matrix Multiplication using Propagation Blocking
    Gu, Zhixiang
    Moreira, Jose
    Edelsohn, David
    Azad, Ariful
    PROCEEDINGS OF THE 32ND ACM SYMPOSIUM ON PARALLELISM IN ALGORITHMS AND ARCHITECTURES (SPAA '20), 2020, : 293 - 303
  • [48] Strassen's Matrix Multiplication on GPUs
    Li, Junjie
    Ranka, Sanjay
    Sahni, Sartaj
    2011 IEEE 17TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2011, : 157 - 164
  • [49] Optimizing Hardware Accelerated General Matrix-Matrix Multiplication for CNNs on FPGAs
    Ahmad, Afzal
    Pasha, Muhammad Adeel
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS, 2020, 67 (11) : 2692 - 2696
  • [50] Optimum Prefetching Patterns Searching: A Case Study of Matrix-Matrix Multiplication
    Khomongkonudom, Varintom
    Chaikarn, Panyayot
    2022 37TH INTERNATIONAL TECHNICAL CONFERENCE ON CIRCUITS/SYSTEMS, COMPUTERS AND COMMUNICATIONS (ITC-CSCC 2022), 2022, : 349 - 352