Multilevel parallelism optimization of stencil computations on SIMDlized NUMA architectures

被引：0

作者：

Kaifang Zhang

Huayou Su

Yong Dou

机构：

[1] National University of Defense Technology,College of Computer

来源：

The Journal of Supercomputing | 2021年 / 77卷

关键词：

Stencil computation; Parallelism optimization; Hybrid programming; NUMA;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Stencil computations within a single core or multicores of an SMP node have been over-investigated. However, the demands on HPC’s higher performance and the rapidly increasing number of cores in modern processors pose new challenges for program developers. These cores are typically organized as several NUMA nodes, which are characterized by remote memory across nodes and local memory with uniform memory access within each node. In this paper, we conducted experiments of stencil computations on NUMA systems based on the two most typical processors, ARM and Intel Xeon E5. We leverage a hybrid programming approach by combining MPI and OpenMP to exploit the potential benefits among NUMA nodes and within a NUMA node. Optimizations of the two selected 3D stencil computations involve four-level parallelism: block decomposition for NUMA nodes and processes, thread-level parallelism within a NUMA node, and data-level parallelism within a thread based on SIMD extension. Experimental results show that we obtain a maximum speedup of 7.27×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\times }$$\end{document} compared to the pure OpenMP implementations on the ARM platform and 11.68×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\times }$$\end{document} on the Intel platform.

引用

页码：13584 / 13600

页数：16

共 35 条

[21] Energy-efficient Stencil Computations on Distributed GPUs using Dynamic Parallelism and GPU-controlled Communication
Oden, Lena
Klenk, Benjamin
Froening, Holger
[J]. 2014 ENERGY EFFICIENT SUPERCOMPUTING WORKSHOP (E2SC), 2014, : 31 - 40
[22] Hierarchical parallelization and optimization of high-order stencil computations on multicore clusters
Hikmet Dursun
Manaschai Kunaseth
Ken-ichi Nomura
Jacqueline Chame
Robert F. Lucas
Chun Chen
Mary Hall
Rajiv K. Kalia
Aiichiro Nakano
Priya Vashishta
[J]. The Journal of Supercomputing, 2012, 62 : 946 - 966
[23] Hierarchical parallelization and optimization of high-order stencil computations on multicore clusters
Dursun, Hikmet
Kunaseth, Manaschai
Nomura, Ken-ichi
Chame, Jacqueline
Lucas, Robert F.
Chen, Chun
Hall, Mary
Kalia, Rajiv K.
Nakano, Aiichiro
Vashishta, Priya
[J]. JOURNAL OF SUPERCOMPUTING, 2012, 62 (02): : 946 - 966
[24] Multilevel Combinatorial Optimization across Quantum Architectures
Ushijima-Mwesigwa, Hayato
Shaydulin, Ruslan
Negre, Christian F. A.
Mniszewski, Susan M.
[J]. ACM TRANSACTIONS ON QUANTUM COMPUTING, 2021, 2 (01):
[25] YaskSite: Stencil Optimization Techniques Applied to Explicit ODE Methods on Modern Architectures
Alappat, Christie L.
Seiferth, Johannes
Hager, Georg
Korch, Matthias
Rauber, Thomas
Wellein, Gerhard
[J]. CGO '21: PROCEEDINGS OF THE 2021 IEEE/ACM INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION (CGO), 2021, : 174 - 186
[26] Exploiting task and data parallelism in ILUPACK's preconditioned CG solver on NUMA architectures and many-core accelerators
Aliaga, Jose I.
Badia, Rosa M.
Barreda, Maria
Bollhoefer, Matthias
Dufrechou, Ernesto
Ezzatti, Pablo
Quintana-Orti, Enrique S.
[J]. PARALLEL COMPUTING, 2016, 54 : 97 - 107
[27] Coarse Grained Parallelism Optimization for Multicore Architectures: The ALMA Project Approach
Goulas, George
Gogos, Christos
Valouxis, Christos
Alefragis, Panayiotis
Voros, Nikolaos
[J]. RECONFIGURABLE COMPUTING: ARCHITECTURES, TOOLS AND APPLICATIONS, 2013, 7806 : 235 - +
[28] Multilevel Parallel Computations for Solving Multistage Multicriteria Optimization Problems
Gergel, Victor
Kozinov, Evgeny
[J]. COMPUTATIONAL SCIENCE - ICCS 2020, PT I, 2020, 12137 : 17 - 30
[29] Stencil Computation Optimization and Auto-tuning on State-of-the-Art Multicore Architectures
Datta, Kaushik
Murphy, Mark
Volkov, Vasily
Williams, Samuel
Carter, Jonathan
Oliker, Leonid
Patterson, David
Shalf, John
Yelick, Katherine
[J]. INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2008, : 510 - +
[30] MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures
Gysi, Tobias
Grosser, Tobias
Hoefler, Torsten
[J]. PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING (ICS'15), 2015, : 177 - 186

← 1 2 3 4 →