A Message-Driven, Multi-GPU Parallel Sparse Triangular Solver

被引：0

作者：

Ding, Nan ^{[1
]}

Liu, Yang ^{[2
]}

Williams, Samuel ^{[1
]}

Li, Xiaoye S. ^{[2
]}

机构：

[1] Lawrence Berkeley Natl Lab, Computat Res Div, Berkeley, CA 94720 USA

[2] Lawrence Berkeley Natl Lab, Scalable Solvers Grp, Berkeley, CA 94720 USA

来源：

PROCEEDINGS OF THE 2021 SIAM CONFERENCE ON APPLIED AND COMPUTATIONAL DISCRETE ALGORITHMS, ACDA21 | 2021年

关键词：

D O I：

暂无

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Sparse triangular solve is used in conjunction with Sparse LU for solving sparse linear systems, either as a direct solver or as a preconditioner. As GPUs have become a first-class compute citizen, designing an efficient and scalable SpTRSV on multi-GPU HPC systems is imperative. In this paper, we leverage the advantage of GPU-initiated data transfers of NVSHMEM to implement and evaluate a Multi-GPU SpTRSV. We create a novel producer-consumer paradigm to manage the computation and communication in SpTRSV and implement it using two CUDA streams. Our multi-GPU SpTRSV implementation using CUDA streams achieves a 3.7x speedup when using twelve GPUs (two nodes) relative to our implementation on a single GPU, and up to 6.1x compared to cusparse csrsv2() over the range of one to eighteen GPUs. To further explain the observed performance and explore the key features of matrices to estimate the potential performance benefits when using multi-GPU, we extend the critical path model of SpTRSV to GPUs. We demonstrate the ability of our performance model to understand various aspects of performance and performance bottlenecks on multi-GPU and motivate code optimizations.

引用

页码：147 / 159

页数：13

共 50 条

[41] High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning
Cevahir, Ali
Nukada, Akira
Matsuoka, Satoshi
COMPUTER SCIENCE-RESEARCH AND DEVELOPMENT, 2010, 25 (1-2): : 83 - 91
[42] Multi-GPU implementation of a hybrid thermal lattice Boltzmann solver using the TheLMA framework
Obrecht, Christian
Kuznik, Frederic
Tourancheau, Bernard
Roux, Jean-Jacques
COMPUTERS & FLUIDS, 2013, 80 : 269 - 275
[43] A multi-GPU parallel optimization model for the preconditioned conjugate gradient algorithm
Gao, Jiaquan
Zhou, Yuanshen
He, Guixia
Xia, Yifei
PARALLEL COMPUTING, 2017, 63 : 1 - 16
[44] Parallel Computing Model and Performance Prediction based on Multi-GPU Environments
Wang, Zhuowei
Xu, Xianbin
Zhao, Wuqing
2011 INTERNATIONAL CONFERENCE ON FUTURE COMPUTERS IN EDUCATION (ICFCE 2011), VOL I, 2011, : 309 - 312
[45] Parallel Generation of Digitally Reconstructed Radiographs on Heterogeneous Multi-GPU Workstations
Abdellah, Marwan
Abdelaziz, Asem
Ali, Eslam
Abdelaziz, Sherief
Sayed, Abdelrahman
Owis, Mohamed I.
Eldeib, Ayman
2016 38TH ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY (EMBC), 2016, : 3953 - 3956
[46] Exploring parallel multi-GPU local search strategies in a metaheuristic framework
Rios, Eyder
Ochi, Luiz Satoru
Boeres, Cristina
Coelho, Vitor N.
Coelho, Igor M.
Farias, Ricardo
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2018, 111 : 39 - 55
[47] Parallel Algorithm for Landform Attributes Representation on Multicore and Multi-GPU Systems
Boratto, Murilo
Alonso, Pedro
Ramiro, Carla
Barreto, Marcos
Coelho, Leandro
COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2012, PT I, 2012, 7333 : 29 - 43
[48] Parallel Sparse Linear Solver GMRES for GPU Clusters with Compression of Exchanged Data
Bahi, Jacques M.
Couturier, Raphael
Khodja, Lilia Ziane
EURO-PAR 2011: PARALLEL PROCESSING WORKSHOPS, PT I, 2012, 7155 : 471 - 480
[49] AxoNN: An asynchronous, message-driven parallel framework for extreme-scale deep learning
Singh, Siddharth
Bhatele, Abhinav
2022 IEEE 36TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2022), 2022, : 606 - 616
[50] Algebraic Block Multi-Color Ordering Method for Parallel Multi-Threaded Sparse Triangular Solver in ICCG Method
Iwashita, Takeshi
Nakashima, Hiroshi
Takahashi, Yasuhito
2012 IEEE 26TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2012, : 474 - 483

← 1 2 3 4 5 →