Towards Highly Efficient DGEMM on the Emerging SW26010 Many-core Processor

被引：31

作者：

Jiang, Lijuan ^{[1
,2
]}

Yang, Chao ^{[1
,3
]}

Ao, Yulong ^{[1
,2
]}

Yin, Wanwang ^{[4
]}

Ma, Wenjing ^{[1
,3
]}

Sun, Qiao ^{[1
]}

Liu, Fangfang ^{[1
,2
]}

Lin, Rongfen ^{[4
]}

Zhang, Peng ^{[1
,2
]}

机构：

[1] Chinese Acad Sci, Inst Software, Beijing, Peoples R China

[2] Univ Chinese Acad Sci, Beijing, Peoples R China

[3] Chinese Acad Sci, State Key Lab Comp Sci, Beijing, Peoples R China

[4] Natl Res Ctr Parallel Comp Engn & Technol, Beijing, Peoples R China

来源：

2017 46TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP) | 2017年

基金：

中国国家自然科学基金;

关键词：

DGEMM; dense linear algebra; SW26010; processor; many-core architecture; Sunway TaihuLight; LINPACK BENCHMARK;

D O I：

10.1109/ICPP.2017.51

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The matrix-matrix multiplication is an essential building block that can be found in various scientific and engineering applications. High-performance implementations of the matrix-matrix multiplication on state-of-the-art processors may be of great importance for both the vendors and the users. In this paper, we present a detailed methodology of implementing and optimizing the double-precision general format matrix-matrix multiplication (DGEMM) kernel on the emerging SW26010 processor, which is used to build the Sunway TaihuLight supercomputer. We propose a three-level blocking algorithm to orchestrate data on the memory hierarchy and expose parallelism on different hardware levels, and design a collective data sharing scheme by using the register communication mechanism to exchange data efficiently among different cores. On top of those, further optimizations are done based on a data-thread mapping method for efficient data distribution, a double buffering scheme for asynchronous DMA data transfer, and an instruction scheduling method for maximizing the pipeline usage. Experiment results show that the proposed DGEMM implementation can fully exploit the unique hardware features provided by SW26010 and can sustain up to 95% of the peak performance.

引用

页码：422 / 431

页数：10

共 50 条

[1] Benchmarking SW26010 Many-core Processor
Xu, Zhigeng
Lin, James
Matsuoka, Satoshi
2017 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2017, : 743 - 752
[2] Enabling Highly Efficient Batched Matrix Multiplications on SW26010 Many-core Processor
Jiang, Lijuan
Yang, Chao
Ma, Wenjing
ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2020, 17 (01)
[3] UNAT: UNstructured Acceleration Toolkit on SW26010 many-core processor
Liu, Hongbin
Ren, Hu
Gu, Hanfeng
Gao, Fei
Yang, Guangwen
ENGINEERING COMPUTATIONS, 2020, 37 (09) : 3187 - 3208
[4] Runtime Adaptive Matrix Multiplication for the SW26010 Many-Core Processor
Wu, Zheng
Li, Mingfan
Chi, Mengxian
Xu, Le
An, Hong
IEEE ACCESS, 2020, 8 : 156915 - 156928
[5] Efficient Implementation of Multilevel Fast Multipole Algorithm on SW26010 Many-core Processor
He, Wei-Jia
Yang, Ming-Lin
Sheng, Xin-Qing
2020 IEEE MTT-S INTERNATIONAL CONFERENCE ON NUMERICAL ELECTROMAGNETIC AND MULTIPHYSICS MODELING AND OPTIMIZATION (NEMO 2020), 2020,
[6] Enabling Highly Efficient k-Means Computations on the SW26010 Many-Core Processor of Sunway TaihuLight
Li, Min
Yang, Chao
Sun, Qiao
Ma, Wen-Jing
Cao, Wen-Long
Ao, Yu-Long
JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2019, 34 (01) : 77 - 93
[7] Enabling Highly Efficient k-Means Computations on the SW26010 Many-Core Processor of Sunway TaihuLight
Min Li
Chao Yang
Qiao Sun
Wen-Jing Ma
Wen-Long Cao
Yu-Long Ao
Journal of Computer Science and Technology, 2019, 34 : 77 - 93
[8] SW-LZMA: Parallel Implementation of LZMA Based on SW26010 Many-Core Processor
Li, Bingzheng
Xu, Jinchen
Liu, Zijing
WIRELESS COMMUNICATIONS & MOBILE COMPUTING, 2021, 2021
[9] Efficient parallelization of multilevel fast multipole algorithm for electromagnetic simulation on many-core SW26010 processor
Wei-Jia He
Ming-Lin Yang
Wu Wang
Xin-Qing Sheng
The Journal of Supercomputing, 2021, 77 : 1502 - 1516
[10] swATOP: Automatically Optimizing Deep Learning Operators on SW26010 Many-Core Processor
Gao, Wei
Fang, Jiarui
Zhao, Wenlai
Yang, Jinzhe
Wang, Long
Gan, Lin
Fu, Haohuan
Yang, Guangwen
PROCEEDINGS OF THE 48TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP 2019), 2019,

← 1 2 3 4 5 →