Fast Kronecker Matrix-Matrix Multiplication on GPUs

被引:0
|
作者
Jangda, Abhinav [1 ]
Yadav, Mohit [2 ]
机构
[1] Microsoft Res, Redmond, WA 98052 USA
[2] Univ Massachusetts, Amherst, MA 01003 USA
关键词
Graphics Processing Units; CUDA; Kronecker; Product; Linear Algebra; TENSOR CONTRACTION; CPU;
D O I
10.1145/3627535.3638489
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Kronecker Matrix-Matrix Multiplication (Kron-Matmul) is the multiplication of a matrix with the Kronecker Product of several smaller matrices. Kron-Matmul is a core operation for many scientific and machine learning computations. State-of-the-art Kron-Matmul implementations utilize existing tensor algebra operations, such as matrix multiplication, transpose, and tensor matrix multiplication. However, this design choice prevents several Kron-Matmul specific optimizations, thus, leaving significant performance on the table. To address this issue, we present FastKron, an efficient technique for Kron-Matmul on single and multiple GPUs. FastKron is independent of linear algebra operations enabling several new optimizations for Kron-Matmul. Thus, it performs up to 40.7x and 7.85x faster than existing implementations on 1 and 16 GPUs respectively.
引用
收藏
页码:390 / 403
页数:14
相关论文
共 50 条
  • [1] Efficient Symmetric Band Matrix-Matrix Multiplication on GPUs
    Dufrechou, Ernesto
    Ezzatti, Pablo
    Quintana-Orti, Enrique S.
    Remon, Alfredo
    [J]. HIGH PERFORMANCE COMPUTING, CARLA 2014, 2014, 485 : 1 - 12
  • [2] Matrix-Matrix Multiplication Using Multiple GPUs Connected by Nvlink
    Choi, Yea Rem
    Nikolskiy, Vsevolod
    Stegailov, Vladimir
    [J]. 2020 GLOBAL SMART INDUSTRY CONFERENCE (GLOSIC), 2020, : 354 - 361
  • [3] A framework for general sparse matrix-matrix multiplication on GPUs and heterogeneous processors
    Liu, Weifeng
    Vinter, Brian
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2015, 85 : 47 - 61
  • [4] Predicting optimal sparse general matrix-matrix multiplication algorithm on GPUs
    Wei, Bingxin
    Wang, Yizhuo
    Chang, Fangli
    Gao, Jianhua
    Ji, Weixing
    [J]. INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2024, 38 (03): : 245 - 259
  • [5] Register-based Implementation of the Sparse General Matrix-Matrix Multiplication on GPUs
    Liu, Junhong
    He, Xin
    Liu, Weifeng
    Tan, Guangming
    [J]. ACM SIGPLAN NOTICES, 2018, 53 (01) : 407 - 408
  • [6] TileSpGEMM: A Tiled Algorithm for Parallel Sparse General Matrix-Matrix Multiplication on GPUs
    Niu, Yuyao
    Lu, Zhengyang
    Ji, Haonan
    Song, Shuhui
    Jin, Zhou
    Liu, Weifeng
    [J]. PPOPP'22: PROCEEDINGS OF THE 27TH ACM SIGPLAN SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING, 2022, : 90 - 106
  • [7] TSM2: Optimizing Tall-and-Skinny Matrix-Matrix Multiplication on GPUs
    Chen, Jieyang
    Xiong, Nan
    Liang, Xin
    Tao, Dingwen
    Li, Sihuan
    Ouyang, Kaiming
    Zhao, Kai
    DeBardeleben, Nathan
    Guan, Qiang
    Chen, Zizhong
    [J]. INTERNATIONAL CONFERENCE ON SUPERCOMPUTING (ICS 2019), 2019, : 106 - 116
  • [8] Matrix-matrix multiplication on heterogeneous platforms
    Beaumont, O
    Boudet, V
    Rastello, F
    Robert, Y
    [J]. 2000 INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, PROCEEDINGS, 2000, : 289 - 298
  • [9] PERFORMANCE EVALUATION OF SPARSE MATRIX-MATRIX MULTIPLICATION
    Jain-Mendon, Shweta
    Sass, Ron
    [J]. 2013 23RD INTERNATIONAL CONFERENCE ON FIELD PROGRAMMABLE LOGIC AND APPLICATIONS (FPL 2013) PROCEEDINGS, 2013,
  • [10] Hypergraph partitioning for sparse matrix-matrix multiplication
    Ballard, Grey
    Druinsky, Alex
    Knight, Nicholas
    Schwartz, Oded
    [J]. ACM Transactions on Parallel Computing, 2016, 3 (03) : 1 - 34