High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning

被引：34

作者：

Cevahir, Ali ^{[1
]}

Nukada, Akira ^{[3
]}

Matsuoka, Satoshi ^{[2
,4
]}

机构：

[1] Tokyo Inst Technol, Dept Math & Comp Sci, Meguro Ku, Ookayama 2-12-1, Tokyo 1528552, Japan

[2] Tokyo Inst Technol, Natl Inst Informat, JST CREST, Chiyoda Ku, Tokyo 1018430, Japan

[3] Tokyo Inst Technol, Meguro Ku, Tokyo 1528552, Japan

[4] Natl Inst Informat, Chiyoda Ku, Tokyo 1018430, Japan

来源：

COMPUTER SCIENCE-RESEARCH AND DEVELOPMENT | 2010年 / 25卷 / 1-2期

基金：

日本科学技术振兴机构;

关键词：

GPU computing; GPU cluster; Conjugate Gradients; Hypergraph partitioning;

D O I：

10.1007/s00450-010-0112-6

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Motivated by high computation power and low price per performance ratio of GPUs, GPU accelerated clusters are being built for high performance scientific computing. In this work, we propose a scalable implementation of a Conjugate Gradient (CG) solver for unstructured matrices on a GPU-extended cluster, where each cluster node has multiple GPUs. Basic computations of the solver are held on GPUs and communications are managed by the CPU. For sparse matrix-vector multiplication, which is the most timeconsuming operation, solver selects the fastest between several high performance kernels running on GPUs. In a GPUextended cluster, it is more difficult than traditional CPU clusters to obtain scalability, since GPUs are very fast compared to CPUs. Since computation on GPUs is faster, GPUextended clusters demand faster communication between compute units. To achieve scalability, we adopt hypergraph-partitioning models, which are state-of-the-art models for communication reduction and load balancing for parallel sparse iterative solvers. We implement a hierarchical partitioning model which better optimizes underlying heterogeneous system. In our experiments, we obtain up to 94 Gflops double-precision CG performance using 64 NVIDIA Tesla GPUs on 32 nodes.

引用

页码：83 / 91

页数：9

共 50 条

[21] Algorithmic skeletons for multi-core, multi-GPU systems and clusters
Ernsting, Steffen
Kuchen, Herbert
International Journal of High Performance Computing and Networking, 2012, 7 (02) : 129 - 138
[22] High Performance Single and Multi-GPU Acceleration for Diffuse Optical Tomography
Saikia, Manob Jyoti
Kanhirodan, Rajan
2014 INTERNATIONAL CONFERENCE ON CONTEMPORARY COMPUTING AND INFORMATICS (IC3I), 2014, : 1320 - 1323
[23] Efficient SDS Simulations on Multi-GPU Nodes of XSEDE High-end Clusters
Schlachter, Samuel
Herbein, Stephen
Taufer, Michela
Ou, Shuching
Patel, Sandeep
Logan, Jeremy S.
2013 IEEE 9TH INTERNATIONAL CONFERENCE ON E-SCIENCE (E-SCIENCE), 2013, : 116 - 123
[24] Accelerating neural network architecture search using multi-GPU high-performance computing
Lupion, Marcos
Cruz, N. C.
Sanjuan, Juan F.
Paechter, B.
Ortigosa, Pilar M.
JOURNAL OF SUPERCOMPUTING, 2023, 79 (07): : 7609 - 7625
[25] Accelerating neural network architecture search using multi-GPU high-performance computing
Marcos Lupión
N. C. Cruz
Juan F. Sanjuan
B. Paechter
Pilar M. Ortigosa
The Journal of Supercomputing, 2023, 79 : 7609 - 7625
[26] A Comparative Study of Preconditioners for GPU-Accelerated Conjugate Gradient Solver
Chen, Yao
Zhao, Yonghua
Zhao, Wei
Zhao, Lian
2013 IEEE 15TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS & 2013 IEEE INTERNATIONAL CONFERENCE ON EMBEDDED AND UBIQUITOUS COMPUTING (HPCC_EUC), 2013, : 628 - 635
[27] Distributed Join Algorithms on Multi-GPU Clusters with GPUDirect RDMA
Guo, Chengxin
Chen, Hong
Zhang, Feng
Li, Cuiping
PROCEEDINGS OF THE 48TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP 2019), 2019,
[28] Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications
Zhong, Ziming
Rychkov, Vladimir
Lastovetsky, Alexey
2012 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2012, : 191 - 199
[29] Efficient implementation of data flow graphs on multi-gpu clusters
Vincent Boulos
Sylvain Huet
Vincent Fristot
Luc Salvo
Dominique Houzet
Journal of Real-Time Image Processing, 2014, 9 : 217 - 232
[30] Efficient implementation of data flow graphs on multi-gpu clusters
Boulos, Vincent
Huet, Sylvain
Fristot, Vincent
Salvo, Luc
Houzet, Dominique
JOURNAL OF REAL-TIME IMAGE PROCESSING, 2014, 9 (01) : 217 - 232

← 1 2 3 4 5 →