Hierarchical Matrix Operations on GPUs: Matrix-Vector Multiplication and Compression

被引:17
|
作者
Boukaram, Wajih [1 ,3 ]
Turkiyyah, George [2 ]
Keyes, David [1 ,3 ]
机构
[1] KAUST, Extreme Comp Res Ctr, Thuwal, Saudi Arabia
[2] Amer Univ Beirut, Dept Comp Sci, Beirut, Lebanon
[3] King Abdullah Univ Sci & Technol, Appl Math & Computat Sci, Thuwal, Saudi Arabia
来源
关键词
Hierarchical matrices; matrix compression; matvec; manycore algorithms; GPU; CUDA;
D O I
10.1145/3232850
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Hierarchical matrices are space-and time-efficient representations of dense matrices that exploit the low-rank structure of matrix blocks at different levels of granularity. The hierarchically low-rank block partitioning produces representations that can be stored and operated on in near-linear complexity instead of the usual polynomial complexity of dense matrices. In this article, we present high-performance implementations of matrix vector multiplication and compression operations for the H-2 variant of hierarchical matrices on GPUs. The H-2 variant exploits, in addition to the hierarchical block partitioning, hierarchical bases for the block representations and results in a scheme that requires only O(n) storage and O(n) complexity for the mat-vec and compression kernels. These two operations are at the core of algebraic operations for hierarchical matrices, the mat-vec being a ubiquitous operation in numerical algorithms while compression/ recompression represents a key building block for other algebraic operations, which require periodic recompression during execution. The difficulties in developing efficient GPU algorithms come primarily from the irregular tree data structures that underlie the hierarchical representations, and the key to performance is to recast the computations on flattened trees in ways that allow batched linear algebra operations to be performed. This requires marshaling the irregularly laid out data in a way that allows them to be used by the batched routines. Marshaling operations only involve pointer arithmetic with no data movement and as a result have minimal overhead. Our numerical results on covariance matrices from 2D and 3D problems from spatial statistics show the high efficiency our routines achieve over 550GB/s for the bandwidth-limited matrix-vector operation and over 850GFLOPS/s in sustained performance for the compression operation on the P100 Pascal GPU.
引用
收藏
页数:28
相关论文
共 50 条
  • [1] Optimizing Sparse Matrix-Vector Multiplication on GPUs via Index Compression
    Sun, Xue
    Wei, Kai-Cheng
    Lai, Lien-Fu
    Tsai, Sung-Han
    Wu, Chao-Chin
    [J]. PROCEEDINGS OF 2018 IEEE 3RD ADVANCED INFORMATION TECHNOLOGY, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (IAEAC 2018), 2018, : 598 - 602
  • [2] CoAdELL: Adaptivity and Compression for Improving Sparse Matrix-Vector Multiplication on GPUs
    Maggioni, Marco
    Berger-Wolf, Tanya
    [J]. PROCEEDINGS OF 2014 IEEE INTERNATIONAL PARALLEL & DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2014, : 934 - 941
  • [3] Optimization techniques for sparse matrix-vector multiplication on GPUs
    Maggioni, Marco
    Berger-Wolf, Tanya
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2016, 93-94 : 66 - 86
  • [4] CUDA-enabled Sparse Matrix-Vector Multiplication on GPUs using atomic operations
    Dang, Hoang-Vu
    Schmidt, Bertil
    [J]. PARALLEL COMPUTING, 2013, 39 (11) : 737 - 750
  • [5] Implementing Blocked Sparse Matrix-Vector Multiplication on NVIDIA GPUs
    Monakov, Alexander
    Avetisyan, Arutyun
    [J]. EMBEDDED COMPUTER SYSTEMS: ARCHITECTURES, MODELING, AND SIMULATION, PROCEEDINGS, 2009, 5657 : 289 - 297
  • [6] Optimization of Sparse Matrix-Vector Multiplication with Variant CSR on GPUs
    Feng, Xiaowen
    Jin, Hai
    Zheng, Ran
    Hu, Kan
    Zeng, Jingxiang
    Shao, Zhiyuan
    [J]. 2011 IEEE 17TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2011, : 165 - 172
  • [7] Scaleable Sparse Matrix-Vector Multiplication with Functional Memory and GPUs
    Tanabe, Noboru
    Ogawa, Yuuka
    Takata, Masami
    Joe, Kazuki
    [J]. PROCEEDINGS OF THE 19TH INTERNATIONAL EUROMICRO CONFERENCE ON PARALLEL, DISTRIBUTED, AND NETWORK-BASED PROCESSING, 2011, : 101 - 108
  • [8] Multiple-precision sparse matrix-vector multiplication on GPUs
    Isupov, Konstantin
    [J]. JOURNAL OF COMPUTATIONAL SCIENCE, 2022, 61
  • [9] Dense and Sparse Matrix-Vector Multiplication on Maxwell GPUs with PyCUDA
    Nurudin Alvarez, Francisco
    Antonio Ortega-Toro, Jose
    Ujaldon, Manuel
    [J]. HIGH PERFORMANCE COMPUTING CARLA 2016, 2017, 697 : 219 - 229
  • [10] Fast Sparse Matrix-Vector Multiplication on GPUs for Graph Applications
    Ashari, Arash
    Sedaghati, Naser
    Eisenlohr, John
    Parthasarathy, Srinivasan
    Sadayappan, P.
    [J]. SC14: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2014, : 781 - 792