Fast Kronecker Matrix-Matrix Multiplication on GPUs

被引:0
|
作者
Jangda, Abhinav [1 ]
Yadav, Mohit [2 ]
机构
[1] Microsoft Res, Redmond, WA 98052 USA
[2] Univ Massachusetts, Amherst, MA 01003 USA
关键词
Graphics Processing Units; CUDA; Kronecker; Product; Linear Algebra; TENSOR CONTRACTION; CPU;
D O I
10.1145/3627535.3638489
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Kronecker Matrix-Matrix Multiplication (Kron-Matmul) is the multiplication of a matrix with the Kronecker Product of several smaller matrices. Kron-Matmul is a core operation for many scientific and machine learning computations. State-of-the-art Kron-Matmul implementations utilize existing tensor algebra operations, such as matrix multiplication, transpose, and tensor matrix multiplication. However, this design choice prevents several Kron-Matmul specific optimizations, thus, leaving significant performance on the table. To address this issue, we present FastKron, an efficient technique for Kron-Matmul on single and multiple GPUs. FastKron is independent of linear algebra operations enabling several new optimizations for Kron-Matmul. Thus, it performs up to 40.7x and 7.85x faster than existing implementations on 1 and 16 GPUs respectively.
引用
下载
收藏
页码:390 / 403
页数:14
相关论文
共 50 条
  • [1] Efficient Symmetric Band Matrix-Matrix Multiplication on GPUs
    Dufrechou, Ernesto
    Ezzatti, Pablo
    Quintana-Orti, Enrique S.
    Remon, Alfredo
    HIGH PERFORMANCE COMPUTING, CARLA 2014, 2014, 485 : 1 - 12
  • [2] Matrix-Matrix Multiplication Using Multiple GPUs Connected by Nvlink
    Choi, Yea Rem
    Nikolskiy, Vsevolod
    Stegailov, Vladimir
    2020 GLOBAL SMART INDUSTRY CONFERENCE (GLOSIC), 2020, : 354 - 361
  • [3] A framework for general sparse matrix-matrix multiplication on GPUs and heterogeneous processors
    Liu, Weifeng
    Vinter, Brian
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2015, 85 : 47 - 61
  • [4] Predicting optimal sparse general matrix-matrix multiplication algorithm on GPUs
    Wei, Bingxin
    Wang, Yizhuo
    Chang, Fangli
    Gao, Jianhua
    Ji, Weixing
    INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2024, 38 (03): : 245 - 259
  • [5] Register-based Implementation of the Sparse General Matrix-Matrix Multiplication on GPUs
    Liu, Junhong
    He, Xin
    Liu, Weifeng
    Tan, Guangming
    ACM SIGPLAN NOTICES, 2018, 53 (01) : 407 - 408
  • [6] TileSpGEMM: A Tiled Algorithm for Parallel Sparse General Matrix-Matrix Multiplication on GPUs
    Niu, Yuyao
    Lu, Zhengyang
    Ji, Haonan
    Song, Shuhui
    Jin, Zhou
    Liu, Weifeng
    PPOPP'22: PROCEEDINGS OF THE 27TH ACM SIGPLAN SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING, 2022, : 90 - 106
  • [7] TSM2: Optimizing Tall-and-Skinny Matrix-Matrix Multiplication on GPUs
    Chen, Jieyang
    Xiong, Nan
    Liang, Xin
    Tao, Dingwen
    Li, Sihuan
    Ouyang, Kaiming
    Zhao, Kai
    DeBardeleben, Nathan
    Guan, Qiang
    Chen, Zizhong
    INTERNATIONAL CONFERENCE ON SUPERCOMPUTING (ICS 2019), 2019, : 106 - 116
  • [8] Matrix-matrix multiplication on heterogeneous platforms
    Beaumont, O
    Boudet, V
    Rastello, F
    Robert, Y
    2000 INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, PROCEEDINGS, 2000, : 289 - 298
  • [9] PERFORMANCE EVALUATION OF SPARSE MATRIX-MATRIX MULTIPLICATION
    Jain-Mendon, Shweta
    Sass, Ron
    2013 23RD INTERNATIONAL CONFERENCE ON FIELD PROGRAMMABLE LOGIC AND APPLICATIONS (FPL 2013) PROCEEDINGS, 2013,
  • [10] Hypergraph partitioning for sparse matrix-matrix multiplication
    Ballard G.
    Druinsky A.
    Knight N.
    Schwartz O.
    ACM Transactions on Parallel Computing, 2016, 3 (03) : 1 - 34