FPGA architecture and implementation of sparse matrix-vector multiplication for the finite element method

被引：17

作者：

Elkurdi, Yousef ^{[1
]}

Fernandez, David ^{[1
]}

Souleimanov, Evgueni ^{[1
]}

Giannacopoulos, Dennis ^{[1
]}

Gross, Warren J. ^{[1
]}

机构：

[1] McGill Univ, Dept Elect & Comp Engn, Montreal, PQ H3A 2A7, Canada

来源：

COMPUTER PHYSICS COMMUNICATIONS | 2008年 / 178卷 / 08期

关键词：

FPGA; SMVM; FEM;

D O I：

10.1016/j.cpc.2007.11.014

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

The Finite Element Method (FEM) is a computationally intensive scientific and engineering analysis tool that has diverse applications ranging from structural engineering to electromagnetic simulation. The trends in floating-point performance are moving in favor of Field-Progmmmable Gate Arrays (FPGAs), hence increasing interest has grown in the scientific community to exploit this technology. We present an architecture and implementation of an FPGA-based sparse matrix-vector multiplier (SMVM) for use in the iterative solution of large, sparse systems of equations arising from FEM applications. FEM matrices display specific sparsity patterns that can be exploited to improve the efficiency of hardware designs. Our architecture exploits FEM matrix sparsity structure to achieve a balance between performance and hardware resource requirements by relying on external SDRAM for data storage while utilizing the FPGAs computational resources in a stream-through systolic approach. The architecture is based on a pipelined linear array of processing elements (PEs) coupled with a hardware-oriented matrix striping algorithm and a partitioning scheme which enables it to process arbitrarily big matrices without changing the number of PEs in the architecture. Therefore, this architecture is only limited by the amount of external RAM available to the FPGA. The implemented SMVM-pipeline prototype contains 8 PEs and is clocked at 110 MHz obtaining a peak performance of 1.76 GFLOPS. For 8 GB/s of memory bandwidth typical of recent FPGA systems, this architecture can achieve 1.5 GFLOPS sustained performance. Using multiple instances of the pipeline, linear scaling of the peak and sustained performance can be achieved. Our stream-through architecture provides the added advantage of enabling an iterative implementation of the SMVM computation required by iterative solution techniques such as the conjugate gradient method, avoiding initialization time due to data loading and setup inside the FPGA internal memory. (c) 2007 Elsevier B.V. All rights reserved.

引用

页码：558 / 570

页数：13

共 50 条

[31] Implementation of a floating-point matrix-vector multiplication on a reconfigurable architecture
Garzia, Fabio
Brunelli, Claudio
Rossi, Davide
Nurmi, Jari
2008 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, VOLS 1-8, 2008, : 3496 - +
[32] An architecture-aware technique for optimizing sparse matrix-vector multiplication on GPUs
Maggioni, Marco
Berger-Wolf, Tanya
2013 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE, 2013, 18 : 329 - 338
[33] Fast Matrix-Vector Multiplication in the Sparse-Grid Galerkin Method
Zeiser, Andreas
JOURNAL OF SCIENTIFIC COMPUTING, 2011, 47 (03) : 328 - 346
[34] A Nested Dissection Partitioning Method for Parallel Sparse Matrix-Vector Multiplication
Boman, Erik G.
Wolf, Michael M.
2013 IEEE CONFERENCE ON HIGH PERFORMANCE EXTREME COMPUTING (HPEC), 2013,
[35] FPGA Design and Implementation of Dense Matrix-Vector Multiplication for Image Processing Application
Qasim, Syed M.
Telba, Ahmed A.
AlMazroo, Abdulhameed Y.
WORLD CONGRESS ON ENGINEERING AND COMPUTER SCIENCE, VOLS 1 AND 2, 2010, : 594 - 597
[36] Fast Matrix-Vector Multiplication in the Sparse-Grid Galerkin Method
Andreas Zeiser
Journal of Scientific Computing, 2011, 47 : 328 - 346
[37] Optimizing the Performance of the Sparse Matrix-Vector Multiplication Kernel in FPGA Guided by the Roofline Model
Favaro, Federico
Dufrechou, Ernesto
Oliver, Juan P.
Ezzatti, Pablo
MICROMACHINES, 2023, 14 (11)
[38] DENSE MATRIX-VECTOR MULTIPLICATION ON THE CUDA ARCHITECTURE
Fujimoto, Noriyuki
PARALLEL PROCESSING LETTERS, 2008, 18 (04) : 511 - 530
[39] Adaptive sparse matrix representation for efficient matrix-vector multiplication
Zardoshti, Pantea
Khunjush, Farshad
Sarbazi-Azad, Hamid
JOURNAL OF SUPERCOMPUTING, 2016, 72 (09): : 3366 - 3386
[40] Autotuning Runtime Specialization for Sparse Matrix-Vector Multiplication
Yilmaz, Buse
Aktemur, Baris
Garzaran, Maria J.
Kamin, Sam
Kirac, Furkan
ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2016, 13 (01)

← 1 2 3 4 5 →