Sampled Dense Matrix Multiplication for High-Performance Machine Learning

被引：16

作者：

Nisa, Israt ^{[1
]}

Sukumaran-Rajam, Aravind ^{[1
]}

Kurt, Sureyya Emre ^{[1
]}

Hong, Changwan ^{[1
]}

Sadayappan, P. ^{[1
]}

机构：

[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA

来源：

2018 IEEE 25TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC) | 2018年

基金：

美国国家科学基金会;

关键词：

SDDMM; GPU; Optimization; Sparse matrix; FACTORIZATION;

D O I：

10.1109/HiPC.2018.00013

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Many machine learning methods involve iterative optimization and are amenable to a variety of alternate formulations. Many currently popular formulations for some machine learning methods based on core operations that essentially correspond to sparse matrix-vector products. A reformulation using sparse matrix-matrix products primitives can potentially enable significant performance enhancement. Sampled Dense-Dense Matrix Multiplication (SDDMM) is a primitive that has been shown to be usable as a core component in reformulations of many machine learning factor analysis algorithms such as Alternating Least Squares (ALS), Latent Dirichlet Allocation (LDA), Sparse Factor Analysis (SFA), and Gamma Poisson (GaP). It requires the computation of the product of two input dense matrices but only at locations of the result matrix corresponding to nonzero entries in a sparse third input matrix. In this paper, we address the development of cuSDDMM, a multi-node GPU-accelerated implementation for SDDMM. We analyze the data reuse characteristics of SDDMM and develop a model-driven strategy for choice of tiling permutation and tile-size choice. cuSDDMM improves significantly (upto 4.6x) over the best currently available GPU implementation of SDDMM (in the BIDMach Machine Learning library).

引用

页码：32 / 41

页数：10

共 50 条

[1] Anatomy of high-performance matrix multiplication
Goto, Kazushige
Van De Geijn, Robert A.
[J]. ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2008, 34 (03):
[2] A family of high-performance matrix multiplication algorithms
Gunnels, JA
Gustavson, FG
Henry, GM
van de Geijn, RA
[J]. APPLIED PARALLEL COMPUTING: STATE OF THE ART IN SCIENTIFIC COMPUTING, 2006, 3732 : 256 - 265
[3] High-Performance Matrix-Vector Multiplication on the GPU
Sorensen, Hans Henrik Brandenborg
[J]. EURO-PAR 2011: PARALLEL PROCESSING WORKSHOPS, PT I, 2012, 7155 : 377 - 386
[4] High-performance systolic arrays for band matrix multiplication
Yang, Y
Zhao, WQ
Inoue, Y
[J]. 2005 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), VOLS 1-6, CONFERENCE PROCEEDINGS, 2005, : 1130 - 1133
[5] A high-performance matrix–matrix multiplication methodology for CPU and GPU architectures
Vasilios Kelefouras
A. Kritikakou
Iosif Mporas
Vasilios Kolonias
[J]. The Journal of Supercomputing, 2016, 72 : 804 - 844
[6] A High-Performance Accelerator for Floating-Point Matrix Multiplication
Jia, Xun
Wu, Gunning
Xie, Xianghui
[J]. 2017 15TH IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING WITH APPLICATIONS AND 2017 16TH IEEE INTERNATIONAL CONFERENCE ON UBIQUITOUS COMPUTING AND COMMUNICATIONS (ISPA/IUCC 2017), 2017, : 396 - 402
[7] Anatomy of High-Performance Many-Threaded Matrix Multiplication
Smith, Tyler M.
van de Geijn, Robert
Smelyanskiy, Mikhail
Hammond, Jeff R.
Van Zee, Field G.
[J]. 2014 IEEE 28TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM, 2014,
[8] A high-performance matrix-matrix multiplication methodology for CPU and GPU architectures
Kelefouras, Vasilios
Kritikakou, A.
Mporas, Iosif
Kolonias, Vasilios
[J]. JOURNAL OF SUPERCOMPUTING, 2016, 72 (03): : 804 - 844
[9] Exploiting Online Locality and Reduction Parallelism for Sampled Dense Matrix Multiplication on GPUs
Yu, Zhongming
Dai, Guohao
Huang, Guyue
Wang, Yu
Yang, Huazhong
[J]. 2021 IEEE 39TH INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD 2021), 2021, : 567 - 574
[10] Fault-tolerant high-performance matrix multiplication:: Theory and practice
Gunnels, JA
Katz, DS
Quintana-Ortí, ES
van de Geijn, RA
[J]. INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS, PROCEEDINGS, 2001, : 47 - 56

← 1 2 3 4 5 →