Optimizing matrix multiplication for a short-vector SIMD architecture - CELL processor

被引：35

作者：

Kurzak, Jakub ^{[1
]}

Alvaro, Wesley ^{[1
]}

Dongarra, Jack ^{[1
,2
,3
,4
]}

机构：

[1] Univ Tennessee, Dept Elect Engn & Comp Sci, Knoxville, TN 37996 USA

[2] Oak Ridge Natl Lab, Div Math & Comp Sci, Oak Ridge, TN USA

[3] Univ Manchester, Sch Math, Manchester, NH USA

[4] Univ Manchester, Sch Comp Sci, Manchester, NH USA

来源：

PARALLEL COMPUTING | 2009年 / 35卷 / 03期

关键词：

Instruction level parallelism; Single Instruction Multiple Data; Synergistic Processing Element; Loop optimizations; Vectorization; LINEAR-EQUATIONS; SOLVING SYSTEMS;

D O I：

10.1016/j.parco.2008.12.010

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Matrix multiplication is one of the most common numerical operations, especially in the area of dense linear algebra, where it forms the core of many important algorithms, including solvers of linear systems of equations, least square problems, and singular and eigen-value computations. The STI CELL processor exceeds the capabilities of any other processor available today in terms of peak single precision, floating point performance, aside from special purpose accelerators like Graphics Processing Units (GPUs). In order to fully exploit the potential of the CELL processor for a wide range of numerical algorithms, fast implementation of the matrix multiplication operation is essential. The crucial component is the matrix multiplication kernel crafted for the short vector Single Instruction Multiple Data architecture of the Synergistic Processing Element of the CELL processor. In this paper, single precision matrix multiplication kernels are presented implementing the C = C - A x B-T operation and the C = C - A x B operation for matrices of size 64 x 64 elements. For the latter case, the performance of 25.55 Gflop/s is reported, or 99.80% of the peak, using as little as 5.9 kB of storage for code and auxiliary data structures. (C) 2009 Elsevier B.V. All rights reserved.

引用

页码：138 / 150

页数：13

共 50 条

[21] Towards a Universal FPGA Matrix-Vector Multiplication Architecture
Kestur, Srinidhi
Davis, John D.
Chung, Eric S.
2012 IEEE 20TH ANNUAL INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES (FCCM), 2012, : 9 - 16
[22] VBSF: a new storage format for SIMD sparse matrix-vector multiplication on modern processors
Li, Yishui
Xie, Peizhen
Chen, Xinhai
Liu, Jie
Yang, Bo
Li, Shengguo
Gong, Chunye
Gan, Xinbiao
Xu, Han
JOURNAL OF SUPERCOMPUTING, 2020, 76 (03): : 2063 - 2081
[23] The study of impact of matrix-processor mapping on the parallel sparse matrix-vector multiplication
Simecek, I.
Langr, D.
Srnec, E.
2013 15TH INTERNATIONAL SYMPOSIUM ON SYMBOLIC AND NUMERIC ALGORITHMS FOR SCIENTIFIC COMPUTING (SYNASC 2013), 2014, : 321 - 328
[24] Modern Generative Programming for Optimizing Small Matrix-Vector Multiplication
Penuchot, Jules
Falcou, Joel
Khabou, Amal
PROCEEDINGS 2018 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING & SIMULATION (HPCS), 2018, : 508 - 514
[25] Implementation and optimization of sparse matrix-vector multiplication on imagine stream processor
Wang, Li
Yang, Xue Jun
Bin Wang, Gui
Yan, Xiao Bo
Deng, Yu
Du, Jing
Zhang, Ying
Tang, Tao
Zeng, Kun
PARALLEL AND DISTRIBUTED PROCESSING AND APPLICATIONS, PROCEEDINGS, 2007, 4742 : 44 - 55
[26] FIBEROPTIC SIGNAL PROCESSOR WITH APPLICATIONS TO MATRIX-VECTOR MULTIPLICATION AND LATTICE FILTERING
TUR, M
GOODMAN, JW
MOSLEHI, B
BOWERS, JE
SHAW, HJ
OPTICS LETTERS, 1982, 7 (09) : 463 - 465
[27] Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor
Pichel, Juan C.
Rivera, Francisco F.
2012 IEEE 26TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS & PHD FORUM (IPDPSW), 2012, : 7 - 15
[28] TIME EFFICIENT SYSTOLIC ARCHITECTURE FOR MATRIX-STAR-VECTOR MULTIPLICATION
ZUBAIR, M
MADAN, BB
INFORMATION PROCESSING LETTERS, 1987, 24 (04) : 225 - 231
[29] Charge-mode parallel architecture for matrix-vector multiplication
Genov, R
Cauwenberghs, G
PROCEEDINGS OF THE 43RD IEEE MIDWEST SYMPOSIUM ON CIRCUITS AND SYSTEMS, VOLS I-III, 2000, : 506 - 509
[30] Modular and Lean Architecture with Elasticity for Sparse Matrix Vector Multiplication on FPGAs
Jain, Abhishek Kumar
Ravishankar, Chirag
Omidian, Hossein
Kumar, Sharan
Kulkarni, Maithilee
Tripathi, Aashish
Gaitonde, Dinesh
2023 IEEE 31ST ANNUAL INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES, FCCM, 2023, : 133 - 143

← 1 2 3 4 5 →