Multilevel parallelism optimization of stencil computations on SIMDlized NUMA architectures

被引：0

作者：

Kaifang Zhang

Huayou Su

Yong Dou

机构：

[1] National University of Defense Technology,College of Computer

来源：

The Journal of Supercomputing | 2021年 / 77卷

关键词：

Stencil computation; Parallelism optimization; Hybrid programming; NUMA;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Stencil computations within a single core or multicores of an SMP node have been over-investigated. However, the demands on HPC’s higher performance and the rapidly increasing number of cores in modern processors pose new challenges for program developers. These cores are typically organized as several NUMA nodes, which are characterized by remote memory across nodes and local memory with uniform memory access within each node. In this paper, we conducted experiments of stencil computations on NUMA systems based on the two most typical processors, ARM and Intel Xeon E5. We leverage a hybrid programming approach by combining MPI and OpenMP to exploit the potential benefits among NUMA nodes and within a NUMA node. Optimizations of the two selected 3D stencil computations involve four-level parallelism: block decomposition for NUMA nodes and processes, thread-level parallelism within a NUMA node, and data-level parallelism within a thread based on SIMD extension. Experimental results show that we obtain a maximum speedup of 7.27×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\times }$$\end{document} compared to the pure OpenMP implementations on the ARM platform and 11.68×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\times }$$\end{document} on the Intel platform.

引用

页码：13584 / 13600

页数：16

共 35 条

[1] Multilevel parallelism optimization of stencil computations on SIMDlized NUMA architectures
Zhang, Kaifang
Su, Huayou
Dou, Yong
[J]. JOURNAL OF SUPERCOMPUTING, 2021, 77 (11): : 13584 - 13600
[2] Automatically Optimizing Stencil Computations on Many-Core NUMA Architectures
Lin, Pei-Hung
Yi, Qing
Quinlan, Daniel
Liao, Chunhua
Yan, Yongqing
[J]. LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING, LCPC 2016, 2017, 10136 : 137 - 152
[3] Islands-of-Cores Approach for Harnessing SMP/NUMA Architectures in Heterogeneous Stencil Computations
Szustak, Lukasz
Wyrzykowski, Roman
Jakl, Ondrej
[J]. PARALLEL COMPUTING TECHNOLOGIES (PACT 2017), 2017, 10421 : 351 - 364
[4] Optimization and Performance Modeling of Stencil Computations on ARM Architectures
Zhang, Kaifang
Su, Huayou
Zhang, Peng
Dou, Yong
[J]. Proceedings - 2020 IEEE 22nd International Conference on High Performance Computing and Communications, IEEE 18th International Conference on Smart City and IEEE 6th International Conference on Data Science and Systems, HPCC-SmartCity-DSS 2020, 2020, : 113 - 121
[5] Tiling Stencil Computations to Maximize Parallelism
Bandishti, Vinayaka
Pananilath, Irshad
Bondhugula, Uday
[J]. 2012 INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC), 2012,
[6] Data Partitioning Strategies for Stencil Computations on NUMA Systems
Feinbube, Frank
Plauth, Max
Knaust, Marius
Polze, Andreas
[J]. EURO-PAR 2017: PARALLEL PROCESSING WORKSHOPS, 2018, 10659 : 597 - 609
[7] Modeling Stencil Computations on Modern HPC Architectures
de la Cruz, Raul
Araya-Polo, Mauricio
[J]. HIGH PERFORMANCE COMPUTING SYSTEMS: PERFORMANCE MODELING, BENCHMARKING, AND SIMULATION, 2015, 8966 : 149 - 171
[8] NUMA Aware Iterative Stencil Computations on Many-Core Systems
Shaheen, Mohammed
Strzodka, Robert
[J]. 2012 IEEE 26TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2012, : 461 - 473
[9] Diamond Tiling: Tiling Techniques to Maximize Parallelism for Stencil Computations
Bondhugula, Uday
Bandishti, Vinayaka
Pananilath, Irshad
[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2017, 28 (05) : 1285 - 1298
[10] Unleashing the performance of ccNUMA multiprocessor architectures in heterogeneous stencil computations
Szustak, Lukasz
Halbiniak, Kamil
Wyrzykowski, Roman
Jakl, Ondrej
[J]. JOURNAL OF SUPERCOMPUTING, 2019, 75 (12): : 7765 - 7777

← 1 2 3 4 →