A Message-Driven, Multi-GPU Parallel Sparse Triangular Solver

被引:0
|
作者
Ding, Nan [1 ]
Liu, Yang [2 ]
Williams, Samuel [1 ]
Li, Xiaoye S. [2 ]
机构
[1] Lawrence Berkeley Natl Lab, Computat Res Div, Berkeley, CA 94720 USA
[2] Lawrence Berkeley Natl Lab, Scalable Solvers Grp, Berkeley, CA 94720 USA
关键词
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Sparse triangular solve is used in conjunction with Sparse LU for solving sparse linear systems, either as a direct solver or as a preconditioner. As GPUs have become a first-class compute citizen, designing an efficient and scalable SpTRSV on multi-GPU HPC systems is imperative. In this paper, we leverage the advantage of GPU-initiated data transfers of NVSHMEM to implement and evaluate a Multi-GPU SpTRSV. We create a novel producer-consumer paradigm to manage the computation and communication in SpTRSV and implement it using two CUDA streams. Our multi-GPU SpTRSV implementation using CUDA streams achieves a 3.7x speedup when using twelve GPUs (two nodes) relative to our implementation on a single GPU, and up to 6.1x compared to cusparse csrsv2() over the range of one to eighteen GPUs. To further explain the observed performance and explore the key features of matrices to estimate the potential performance benefits when using multi-GPU, we extend the critical path model of SpTRSV to GPUs. We demonstrate the ability of our performance model to understand various aspects of performance and performance bottlenecks on multi-GPU and motivate code optimizations.
引用
收藏
页码:147 / 159
页数:13
相关论文
共 50 条
  • [41] High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning
    Cevahir, Ali
    Nukada, Akira
    Matsuoka, Satoshi
    COMPUTER SCIENCE-RESEARCH AND DEVELOPMENT, 2010, 25 (1-2): : 83 - 91
  • [42] Multi-GPU implementation of a hybrid thermal lattice Boltzmann solver using the TheLMA framework
    Obrecht, Christian
    Kuznik, Frederic
    Tourancheau, Bernard
    Roux, Jean-Jacques
    COMPUTERS & FLUIDS, 2013, 80 : 269 - 275
  • [43] A multi-GPU parallel optimization model for the preconditioned conjugate gradient algorithm
    Gao, Jiaquan
    Zhou, Yuanshen
    He, Guixia
    Xia, Yifei
    PARALLEL COMPUTING, 2017, 63 : 1 - 16
  • [44] Parallel Computing Model and Performance Prediction based on Multi-GPU Environments
    Wang, Zhuowei
    Xu, Xianbin
    Zhao, Wuqing
    2011 INTERNATIONAL CONFERENCE ON FUTURE COMPUTERS IN EDUCATION (ICFCE 2011), VOL I, 2011, : 309 - 312
  • [45] Parallel Generation of Digitally Reconstructed Radiographs on Heterogeneous Multi-GPU Workstations
    Abdellah, Marwan
    Abdelaziz, Asem
    Ali, Eslam
    Abdelaziz, Sherief
    Sayed, Abdelrahman
    Owis, Mohamed I.
    Eldeib, Ayman
    2016 38TH ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY (EMBC), 2016, : 3953 - 3956
  • [46] Exploring parallel multi-GPU local search strategies in a metaheuristic framework
    Rios, Eyder
    Ochi, Luiz Satoru
    Boeres, Cristina
    Coelho, Vitor N.
    Coelho, Igor M.
    Farias, Ricardo
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2018, 111 : 39 - 55
  • [47] Parallel Algorithm for Landform Attributes Representation on Multicore and Multi-GPU Systems
    Boratto, Murilo
    Alonso, Pedro
    Ramiro, Carla
    Barreto, Marcos
    Coelho, Leandro
    COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2012, PT I, 2012, 7333 : 29 - 43
  • [48] Parallel Sparse Linear Solver GMRES for GPU Clusters with Compression of Exchanged Data
    Bahi, Jacques M.
    Couturier, Raphael
    Khodja, Lilia Ziane
    EURO-PAR 2011: PARALLEL PROCESSING WORKSHOPS, PT I, 2012, 7155 : 471 - 480
  • [49] AxoNN: An asynchronous, message-driven parallel framework for extreme-scale deep learning
    Singh, Siddharth
    Bhatele, Abhinav
    2022 IEEE 36TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2022), 2022, : 606 - 616
  • [50] Algebraic Block Multi-Color Ordering Method for Parallel Multi-Threaded Sparse Triangular Solver in ICCG Method
    Iwashita, Takeshi
    Nakashima, Hiroshi
    Takahashi, Yasuhito
    2012 IEEE 26TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2012, : 474 - 483