Automatic Code Generation and Optimization of Large-scale Stencil Computation on Many-core Processors

被引：10

作者：

Li, Mingzhen ^{[1
,2
]}

Liu, Yi ^{[2
]}

Yang, Hailong ^{[1
,2
]}

Hu, Yongmin ^{[2
]}

Sun, Qingxiao ^{[2
]}

Chen, Bangduo ^{[2
]}

You, Xin ^{[2
]}

Liu, Xiaoyan ^{[2
]}

Luan, Zhongzhi ^{[2
]}

Qian, Depei ^{[2
]}

机构：

[1] State Key Lab Software Dev Environm, Beijing, Peoples R China

[2] Beihang Univ, Beijing, Peoples R China

来源：

50TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING | 2021年

基金：

中国国家自然科学基金;

关键词：

Stencil; Domain Specific Language; Performance Optimization; Manycore Architecture;

D O I：

10.1145/3472456.3473517

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Stencil computation is an indispensable building block of many scientific applications and is widely used by the numerical solvers of partial differential equations (PDEs). Due to the complex computation patterns of different stencils and the various hardware targets (e.g., many-core processors), many domain-specific languages (DSLs) have been proposed to optimize stencil computation. However, existing stencil DSLs mostly focus on the performance optimizations on homogeneous many-core processors such as CPUs and GPUs, and fail to embrace emerging heterogeneous many-core processors such as Sunway. In addition, few of them can support expressing stencil with multiple time dependencies and optimizations from both spatial and temporal dimensions. Moreover, most stencil DSLs are unable to generate codes that can run efficiently in large scale, which limits their practical applicability. In this paper, we propose MSC, a new stencil DSL designed to express stencil computation in both spatial and temporal dimensions. It can generate high-performance stencil codes for large-scale execution on emerging many-core processors. Specially, we design several optimization primitives for improving parallelism and data locality, and a communication library for efficient halo exchange in large scale execution. The experiment results show that our MSC achieves better performance compared to the state-of-the-art stencil DSLs.

引用

页数：12

共 50 条

[41] CINOC: Computing in Network-On-Chip With Tiled Many-Core Architectures for Large-Scale General Matrix Multiplications
Qin, Yao
Wang, Mingyu
Yan, Jiahua
Lu, Tao
Yu, Zhiyi
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-REGULAR PAPERS, 2024,
[42] Towards optimized tensor code generation for deep learning on sunway many-core processor
Mingzhen Li
Changxi Liu
Jianjin Liao
Xuegui Zheng
Hailong Yang
Rujun Sun
Jun Xu
Lin Gan
Guangwen Yang
Zhongzhi Luan
Depei Qian
Frontiers of Computer Science, 2024, 18
[43] Acceleration of large-scale multi-physics simulation for biomedical EMC with many-core architecture based computing
Suzuki, Y.
Sasaki, M.
Onishi, S.
Imai, R.
Takamura, M.
Taki, M.
Chakarothai, J.
Sasaki, K.
Wake, K.
Watanabe, S.
Kojima, M.
Tsai, C-Y.
Sasaki, H.
2015 1st URSI Atlantic Radio Science Conference (URSI AT-RASC), 2015,
[44] Many-Core CPUs Can Deliver Scalable Performance to Stochastic Simulations of Large-Scale Biochemical Reaction Networks
Kouskoumvekakis, Elias
Soudris, Dimitrios
Manolakos, Elias S.
PROCEEDINGS OF THE 2015 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING & SIMULATION (HPCS 2015), 2015, : 517 - 524
[45] Automatic SoC Design Flow on Many-core Processors: a Software Hardware Co-Design Approach for FPGAs
Liu, Ling
Morozov, Oleksii
Han, Yuxing
Gutknecht, Juerg
Hunziker, Patrick
FPGA 11: PROCEEDINGS OF THE 2011 ACM/SIGDA INTERNATIONAL SYMPOSIUM ON FIELD PROGRAMMABLE GATE ARRAYS, 2011, : 37 - 40
[46] Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors
Nagasaka, Yusuke
Matsuoka, Satoshi
Azad, Ariful
Buluc, Aydin
PARALLEL COMPUTING, 2019, 90
[47] SWIRL: High-performance many-core CPU code generation for deep neural networks
Venkat, Anand
Rusira, Tharindu
Barik, Raj
Hall, Mary
Truong, Leonard
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2019, 33 (06): : 1275 - 1289
[48] An automatic mapping technique for OpenACC kernel code based on deeply fused and heterogeneous many-core architecture
Zhang, Libo
Mao, Xingquan
You, Hongtao
Gu, Long
Jiang, Xiaocheng
CCF TRANSACTIONS ON HIGH PERFORMANCE COMPUTING, 2020, 2 (04) : 323 - 331
[49] An automatic mapping technique for OpenACC kernel code based on deeply fused and heterogeneous many-core architecture
Libo Zhang
Xingquan Mao
Hongtao You
Long Gu
Xiaocheng Jiang
CCF Transactions on High Performance Computing, 2020, 2 : 323 - 331
[50] "Swimming pool"- like distributed architecture for clock generation in large many-core SoC
Shan, Chuan
Anceau, Francois
Galayko, Dimitri
Zianbetov, Eldar
2014 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 2014, : 2768 - 2771

← 1 2 3 4 5 →