Multilevel parallelism optimization of stencil computations on SIMDlized NUMA architectures

被引:0
|
作者
Kaifang Zhang
Huayou Su
Yong Dou
机构
[1] National University of Defense Technology,College of Computer
来源
关键词
Stencil computation; Parallelism optimization; Hybrid programming; NUMA;
D O I
暂无
中图分类号
学科分类号
摘要
Stencil computations within a single core or multicores of an SMP node have been over-investigated. However, the demands on HPC’s higher performance and the rapidly increasing number of cores in modern processors pose new challenges for program developers. These cores are typically organized as several NUMA nodes, which are characterized by remote memory across nodes and local memory with uniform memory access within each node. In this paper, we conducted experiments of stencil computations on NUMA systems based on the two most typical processors, ARM and Intel Xeon E5. We leverage a hybrid programming approach by combining MPI and OpenMP to exploit the potential benefits among NUMA nodes and within a NUMA node. Optimizations of the two selected 3D stencil computations involve four-level parallelism: block decomposition for NUMA nodes and processes, thread-level parallelism within a NUMA node, and data-level parallelism within a thread based on SIMD extension. Experimental results show that we obtain a maximum speedup of 7.27×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\times }$$\end{document} compared to the pure OpenMP implementations on the ARM platform and 11.68×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\times }$$\end{document} on the Intel platform.
引用
收藏
页码:13584 / 13600
页数:16
相关论文
共 35 条
  • [21] Energy-efficient Stencil Computations on Distributed GPUs using Dynamic Parallelism and GPU-controlled Communication
    Oden, Lena
    Klenk, Benjamin
    Froening, Holger
    [J]. 2014 ENERGY EFFICIENT SUPERCOMPUTING WORKSHOP (E2SC), 2014, : 31 - 40
  • [22] Hierarchical parallelization and optimization of high-order stencil computations on multicore clusters
    Hikmet Dursun
    Manaschai Kunaseth
    Ken-ichi Nomura
    Jacqueline Chame
    Robert F. Lucas
    Chun Chen
    Mary Hall
    Rajiv K. Kalia
    Aiichiro Nakano
    Priya Vashishta
    [J]. The Journal of Supercomputing, 2012, 62 : 946 - 966
  • [23] Hierarchical parallelization and optimization of high-order stencil computations on multicore clusters
    Dursun, Hikmet
    Kunaseth, Manaschai
    Nomura, Ken-ichi
    Chame, Jacqueline
    Lucas, Robert F.
    Chen, Chun
    Hall, Mary
    Kalia, Rajiv K.
    Nakano, Aiichiro
    Vashishta, Priya
    [J]. JOURNAL OF SUPERCOMPUTING, 2012, 62 (02): : 946 - 966
  • [24] Multilevel Combinatorial Optimization across Quantum Architectures
    Ushijima-Mwesigwa, Hayato
    Shaydulin, Ruslan
    Negre, Christian F. A.
    Mniszewski, Susan M.
    [J]. ACM TRANSACTIONS ON QUANTUM COMPUTING, 2021, 2 (01):
  • [25] YaskSite: Stencil Optimization Techniques Applied to Explicit ODE Methods on Modern Architectures
    Alappat, Christie L.
    Seiferth, Johannes
    Hager, Georg
    Korch, Matthias
    Rauber, Thomas
    Wellein, Gerhard
    [J]. CGO '21: PROCEEDINGS OF THE 2021 IEEE/ACM INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION (CGO), 2021, : 174 - 186
  • [26] Exploiting task and data parallelism in ILUPACK's preconditioned CG solver on NUMA architectures and many-core accelerators
    Aliaga, Jose I.
    Badia, Rosa M.
    Barreda, Maria
    Bollhoefer, Matthias
    Dufrechou, Ernesto
    Ezzatti, Pablo
    Quintana-Orti, Enrique S.
    [J]. PARALLEL COMPUTING, 2016, 54 : 97 - 107
  • [27] Coarse Grained Parallelism Optimization for Multicore Architectures: The ALMA Project Approach
    Goulas, George
    Gogos, Christos
    Valouxis, Christos
    Alefragis, Panayiotis
    Voros, Nikolaos
    [J]. RECONFIGURABLE COMPUTING: ARCHITECTURES, TOOLS AND APPLICATIONS, 2013, 7806 : 235 - +
  • [28] Multilevel Parallel Computations for Solving Multistage Multicriteria Optimization Problems
    Gergel, Victor
    Kozinov, Evgeny
    [J]. COMPUTATIONAL SCIENCE - ICCS 2020, PT I, 2020, 12137 : 17 - 30
  • [29] Stencil Computation Optimization and Auto-tuning on State-of-the-Art Multicore Architectures
    Datta, Kaushik
    Murphy, Mark
    Volkov, Vasily
    Williams, Samuel
    Carter, Jonathan
    Oliker, Leonid
    Patterson, David
    Shalf, John
    Yelick, Katherine
    [J]. INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2008, : 510 - +
  • [30] MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures
    Gysi, Tobias
    Grosser, Tobias
    Hoefler, Torsten
    [J]. PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING (ICS'15), 2015, : 177 - 186