FitenBLAS: High performance BLAS for a massively multithreaded FT1000 processor

被引:0
|
作者
Chi, Li-Hua [1 ]
Liu, Jie [1 ]
Yan, Yi-Hui [1 ]
Xie, Lin-Chuan [1 ]
Gan, Xin-Biao [1 ]
Hu, Qin-Feng [1 ]
Jiang, Jie [1 ]
Li, Sheng-Guo [1 ]
机构
[1] National Laboratory for Parallel and Distributed Processing, National Univ of Defense Technology, Changsha,Hunan,410073, China
关键词
D O I
暂无
中图分类号
学科分类号
摘要
BLAS library is the fundamental linear algebra library and plays an important role in many large scientific applications. This paper developed a linear algebra library named FitenBLAS on a massively multithreaded FT1000 processor. Based on the hierarchical storage system and the number of registers, multilevel loop unrolling methods were developed for vector-vector, matrix-vector and matrix-matrix linear operations. The codes of FitenBLAS were optimized with instruction layout and data prefetching technology. An avoiding redundant packing method was proposed for parallel matrix-matrix multiplication, and the parallel code was developed. The kernel matrix-matrix multiplication code was optimized with instruction layout, time overlapping of data access and computation, and data blocking. The other BLAS3 subroutines were designed on the matrix multiplication code. The kernel codes of FitenBLAS were developed in assembly language. The performance for the key subroutine of the matrix multiplication reaches 6.91Glops/s, nearly 86.4% of the peak performance of the FT1000. ©, 2015, Hunan University. All right reserved.
引用
收藏
页码:100 / 106
相关论文
共 8 条
  • [1] FitenBLAS:面向FT1000微处理器的高性能线性代数库
    迟利华
    刘杰
    晏益慧
    谢林川
    甘新标
    胡庆丰
    蒋杰
    李胜国
    湖南大学学报(自然科学版), 2015, 42 (04) : 100 - 106
  • [2] FT-BLAS: A High Performance BLAS Implementation With Online Fault Tolerance
    Zhai, Yujia
    Giem, Elisabeth
    Fan, Quan
    Zhao, Kai
    Liu, Jinyang
    Chen, Zizhong
    PROCEEDINGS OF THE 2021 ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, ICS 2021, 2021, : 127 - 138
  • [3] Performance evaluation of low level multithreaded BLAS kernels on intel processor based cc-NUMA systems
    Nishida, A
    Oyanagi, Y
    HIGH PERFORMANCE COMPUTING, 2003, 2858 : 500 - 510
  • [4] FT-BLAS: A Fault Tolerant High Performance BLAS Implementation on x86 CPUs
    Zhai, Yujia
    Giem, Elisabeth
    Zhao, Kai
    Liu, Jinyang
    Huang, Jiajun
    Wong, Bryan M.
    Shelton, Christian R.
    Chen, Zizhong
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2023, 34 (12) : 3207 - 3223
  • [5] High Performance Image Processing on a Massively Parallel Processor Array
    Osorio, Roberto R.
    Diaz-Resco, Cesar
    Bruguera, Javier D.
    PROCEEDINGS OF THE 2009 12TH EUROMICRO CONFERENCE ON DIGITAL SYSTEM DESIGN, ARCHITECTURES, METHODS AND TOOLS, 2009, : 233 - 236
  • [6] High-performance packet classification algorithm for multithreaded IXP network processor
    Liu, Duo
    Chen, Zheng
    Hua, Bei
    Yu, Nenghai
    Tang, Xinan
    ACM TRANSACTIONS ON EMBEDDED COMPUTING SYSTEMS, 2008, 7 (02)
  • [7] Design of a processor element for a high performance massively parallel SIMD system
    Beal, D
    Lambrinoudakis, C
    INTERNATIONAL JOURNAL OF HIGH SPEED COMPUTING, 1995, 7 (03): : 365 - 390
  • [8] E6500: FREESCALE'S LOW-POWER, HIGH-PERFORMANCE MULTITHREADED EMBEDDED PROCESSOR
    Burgess, David
    Gieske, Edmund
    Holt, James
    Hoy, Thomas
    Whisenhunt, Gary
    IEEE MICRO, 2012, 32 (05) : 26 - 36