FitenBLAS: High performance BLAS for a massively multithreaded FT1000 processor

被引：0

作者：

Chi, Li-Hua ^{[1
]}

Liu, Jie ^{[1
]}

Yan, Yi-Hui ^{[1
]}

Xie, Lin-Chuan ^{[1
]}

Gan, Xin-Biao ^{[1
]}

Hu, Qin-Feng ^{[1
]}

Jiang, Jie ^{[1
]}

Li, Sheng-Guo ^{[1
]}

机构：

[1] National Laboratory for Parallel and Distributed Processing, National Univ of Defense Technology, Changsha,Hunan,410073, China

来源：

Hunan Daxue Xuebao/Journal of Hunan University Natural Sciences | 2015年 / 42卷 / 04期

关键词：

D O I：

暂无

中图分类号：

学科分类号：

摘要：

BLAS library is the fundamental linear algebra library and plays an important role in many large scientific applications. This paper developed a linear algebra library named FitenBLAS on a massively multithreaded FT1000 processor. Based on the hierarchical storage system and the number of registers, multilevel loop unrolling methods were developed for vector-vector, matrix-vector and matrix-matrix linear operations. The codes of FitenBLAS were optimized with instruction layout and data prefetching technology. An avoiding redundant packing method was proposed for parallel matrix-matrix multiplication, and the parallel code was developed. The kernel matrix-matrix multiplication code was optimized with instruction layout, time overlapping of data access and computation, and data blocking. The other BLAS3 subroutines were designed on the matrix multiplication code. The kernel codes of FitenBLAS were developed in assembly language. The performance for the key subroutine of the matrix multiplication reaches 6.91Glops/s, nearly 86.4% of the peak performance of the FT1000. ©, 2015, Hunan University. All right reserved.

引用

页码：100 / 106

共 8 条

[1] FitenBLAS:面向FT1000微处理器的高性能线性代数库
迟利华
刘杰
晏益慧
谢林川
甘新标
胡庆丰
蒋杰
李胜国
湖南大学学报(自然科学版), 2015, 42 (04) : 100 - 106
[2] FT-BLAS: A High Performance BLAS Implementation With Online Fault Tolerance
Zhai, Yujia
Giem, Elisabeth
Fan, Quan
Zhao, Kai
Liu, Jinyang
Chen, Zizhong
PROCEEDINGS OF THE 2021 ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, ICS 2021, 2021, : 127 - 138
[3] Performance evaluation of low level multithreaded BLAS kernels on intel processor based cc-NUMA systems
Nishida, A
Oyanagi, Y
HIGH PERFORMANCE COMPUTING, 2003, 2858 : 500 - 510
[4] FT-BLAS: A Fault Tolerant High Performance BLAS Implementation on x86 CPUs
Zhai, Yujia
Giem, Elisabeth
Zhao, Kai
Liu, Jinyang
Huang, Jiajun
Wong, Bryan M.
Shelton, Christian R.
Chen, Zizhong
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2023, 34 (12) : 3207 - 3223
[5] High Performance Image Processing on a Massively Parallel Processor Array
Osorio, Roberto R.
Diaz-Resco, Cesar
Bruguera, Javier D.
PROCEEDINGS OF THE 2009 12TH EUROMICRO CONFERENCE ON DIGITAL SYSTEM DESIGN, ARCHITECTURES, METHODS AND TOOLS, 2009, : 233 - 236
[6] High-performance packet classification algorithm for multithreaded IXP network processor
Liu, Duo
Chen, Zheng
Hua, Bei
Yu, Nenghai
Tang, Xinan
ACM TRANSACTIONS ON EMBEDDED COMPUTING SYSTEMS, 2008, 7 (02)
[7] Design of a processor element for a high performance massively parallel SIMD system
Beal, D
Lambrinoudakis, C
INTERNATIONAL JOURNAL OF HIGH SPEED COMPUTING, 1995, 7 (03): : 365 - 390
[8] E6500: FREESCALE'S LOW-POWER, HIGH-PERFORMANCE MULTITHREADED EMBEDDED PROCESSOR
Burgess, David
Gieske, Edmund
Holt, James
Hoy, Thomas
Whisenhunt, Gary
IEEE MICRO, 2012, 32 (05) : 26 - 36

← 1 →