Towards Highly Efficient DGEMM on the Emerging SW26010 Many-core Processor

被引:31
|
作者
Jiang, Lijuan [1 ,2 ]
Yang, Chao [1 ,3 ]
Ao, Yulong [1 ,2 ]
Yin, Wanwang [4 ]
Ma, Wenjing [1 ,3 ]
Sun, Qiao [1 ]
Liu, Fangfang [1 ,2 ]
Lin, Rongfen [4 ]
Zhang, Peng [1 ,2 ]
机构
[1] Chinese Acad Sci, Inst Software, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
[3] Chinese Acad Sci, State Key Lab Comp Sci, Beijing, Peoples R China
[4] Natl Res Ctr Parallel Comp Engn & Technol, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
DGEMM; dense linear algebra; SW26010; processor; many-core architecture; Sunway TaihuLight; LINPACK BENCHMARK;
D O I
10.1109/ICPP.2017.51
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The matrix-matrix multiplication is an essential building block that can be found in various scientific and engineering applications. High-performance implementations of the matrix-matrix multiplication on state-of-the-art processors may be of great importance for both the vendors and the users. In this paper, we present a detailed methodology of implementing and optimizing the double-precision general format matrix-matrix multiplication (DGEMM) kernel on the emerging SW26010 processor, which is used to build the Sunway TaihuLight supercomputer. We propose a three-level blocking algorithm to orchestrate data on the memory hierarchy and expose parallelism on different hardware levels, and design a collective data sharing scheme by using the register communication mechanism to exchange data efficiently among different cores. On top of those, further optimizations are done based on a data-thread mapping method for efficient data distribution, a double buffering scheme for asynchronous DMA data transfer, and an instruction scheduling method for maximizing the pipeline usage. Experiment results show that the proposed DGEMM implementation can fully exploit the unique hardware features provided by SW26010 and can sustain up to 95% of the peak performance.
引用
收藏
页码:422 / 431
页数:10
相关论文
共 50 条
  • [1] Benchmarking SW26010 Many-core Processor
    Xu, Zhigeng
    Lin, James
    Matsuoka, Satoshi
    2017 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2017, : 743 - 752
  • [2] Enabling Highly Efficient Batched Matrix Multiplications on SW26010 Many-core Processor
    Jiang, Lijuan
    Yang, Chao
    Ma, Wenjing
    ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2020, 17 (01)
  • [3] UNAT: UNstructured Acceleration Toolkit on SW26010 many-core processor
    Liu, Hongbin
    Ren, Hu
    Gu, Hanfeng
    Gao, Fei
    Yang, Guangwen
    ENGINEERING COMPUTATIONS, 2020, 37 (09) : 3187 - 3208
  • [4] Runtime Adaptive Matrix Multiplication for the SW26010 Many-Core Processor
    Wu, Zheng
    Li, Mingfan
    Chi, Mengxian
    Xu, Le
    An, Hong
    IEEE ACCESS, 2020, 8 : 156915 - 156928
  • [5] Efficient Implementation of Multilevel Fast Multipole Algorithm on SW26010 Many-core Processor
    He, Wei-Jia
    Yang, Ming-Lin
    Sheng, Xin-Qing
    2020 IEEE MTT-S INTERNATIONAL CONFERENCE ON NUMERICAL ELECTROMAGNETIC AND MULTIPHYSICS MODELING AND OPTIMIZATION (NEMO 2020), 2020,
  • [6] Enabling Highly Efficient k-Means Computations on the SW26010 Many-Core Processor of Sunway TaihuLight
    Li, Min
    Yang, Chao
    Sun, Qiao
    Ma, Wen-Jing
    Cao, Wen-Long
    Ao, Yu-Long
    JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2019, 34 (01) : 77 - 93
  • [7] Enabling Highly Efficient k-Means Computations on the SW26010 Many-Core Processor of Sunway TaihuLight
    Min Li
    Chao Yang
    Qiao Sun
    Wen-Jing Ma
    Wen-Long Cao
    Yu-Long Ao
    Journal of Computer Science and Technology, 2019, 34 : 77 - 93
  • [8] SW-LZMA: Parallel Implementation of LZMA Based on SW26010 Many-Core Processor
    Li, Bingzheng
    Xu, Jinchen
    Liu, Zijing
    WIRELESS COMMUNICATIONS & MOBILE COMPUTING, 2021, 2021
  • [9] Efficient parallelization of multilevel fast multipole algorithm for electromagnetic simulation on many-core SW26010 processor
    Wei-Jia He
    Ming-Lin Yang
    Wu Wang
    Xin-Qing Sheng
    The Journal of Supercomputing, 2021, 77 : 1502 - 1516
  • [10] swATOP: Automatically Optimizing Deep Learning Operators on SW26010 Many-Core Processor
    Gao, Wei
    Fang, Jiarui
    Zhao, Wenlai
    Yang, Jinzhe
    Wang, Long
    Gan, Lin
    Fu, Haohuan
    Yang, Guangwen
    PROCEEDINGS OF THE 48TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP 2019), 2019,