Multilevel parallelism optimization of stencil computations on SIMDlized NUMA architectures

被引:0
|
作者
Kaifang Zhang
Huayou Su
Yong Dou
机构
[1] National University of Defense Technology,College of Computer
来源
关键词
Stencil computation; Parallelism optimization; Hybrid programming; NUMA;
D O I
暂无
中图分类号
学科分类号
摘要
Stencil computations within a single core or multicores of an SMP node have been over-investigated. However, the demands on HPC’s higher performance and the rapidly increasing number of cores in modern processors pose new challenges for program developers. These cores are typically organized as several NUMA nodes, which are characterized by remote memory across nodes and local memory with uniform memory access within each node. In this paper, we conducted experiments of stencil computations on NUMA systems based on the two most typical processors, ARM and Intel Xeon E5. We leverage a hybrid programming approach by combining MPI and OpenMP to exploit the potential benefits among NUMA nodes and within a NUMA node. Optimizations of the two selected 3D stencil computations involve four-level parallelism: block decomposition for NUMA nodes and processes, thread-level parallelism within a NUMA node, and data-level parallelism within a thread based on SIMD extension. Experimental results show that we obtain a maximum speedup of 7.27×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\times }$$\end{document} compared to the pure OpenMP implementations on the ARM platform and 11.68×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\times }$$\end{document} on the Intel platform.
引用
收藏
页码:13584 / 13600
页数:16
相关论文
共 35 条
  • [1] Multilevel parallelism optimization of stencil computations on SIMDlized NUMA architectures
    Zhang, Kaifang
    Su, Huayou
    Dou, Yong
    [J]. JOURNAL OF SUPERCOMPUTING, 2021, 77 (11): : 13584 - 13600
  • [2] Automatically Optimizing Stencil Computations on Many-Core NUMA Architectures
    Lin, Pei-Hung
    Yi, Qing
    Quinlan, Daniel
    Liao, Chunhua
    Yan, Yongqing
    [J]. LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING, LCPC 2016, 2017, 10136 : 137 - 152
  • [3] Islands-of-Cores Approach for Harnessing SMP/NUMA Architectures in Heterogeneous Stencil Computations
    Szustak, Lukasz
    Wyrzykowski, Roman
    Jakl, Ondrej
    [J]. PARALLEL COMPUTING TECHNOLOGIES (PACT 2017), 2017, 10421 : 351 - 364
  • [4] Optimization and Performance Modeling of Stencil Computations on ARM Architectures
    Zhang, Kaifang
    Su, Huayou
    Zhang, Peng
    Dou, Yong
    [J]. Proceedings - 2020 IEEE 22nd International Conference on High Performance Computing and Communications, IEEE 18th International Conference on Smart City and IEEE 6th International Conference on Data Science and Systems, HPCC-SmartCity-DSS 2020, 2020, : 113 - 121
  • [5] Tiling Stencil Computations to Maximize Parallelism
    Bandishti, Vinayaka
    Pananilath, Irshad
    Bondhugula, Uday
    [J]. 2012 INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC), 2012,
  • [6] Data Partitioning Strategies for Stencil Computations on NUMA Systems
    Feinbube, Frank
    Plauth, Max
    Knaust, Marius
    Polze, Andreas
    [J]. EURO-PAR 2017: PARALLEL PROCESSING WORKSHOPS, 2018, 10659 : 597 - 609
  • [7] Modeling Stencil Computations on Modern HPC Architectures
    de la Cruz, Raul
    Araya-Polo, Mauricio
    [J]. HIGH PERFORMANCE COMPUTING SYSTEMS: PERFORMANCE MODELING, BENCHMARKING, AND SIMULATION, 2015, 8966 : 149 - 171
  • [8] NUMA Aware Iterative Stencil Computations on Many-Core Systems
    Shaheen, Mohammed
    Strzodka, Robert
    [J]. 2012 IEEE 26TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2012, : 461 - 473
  • [9] Diamond Tiling: Tiling Techniques to Maximize Parallelism for Stencil Computations
    Bondhugula, Uday
    Bandishti, Vinayaka
    Pananilath, Irshad
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2017, 28 (05) : 1285 - 1298
  • [10] Unleashing the performance of ccNUMA multiprocessor architectures in heterogeneous stencil computations
    Szustak, Lukasz
    Halbiniak, Kamil
    Wyrzykowski, Roman
    Jakl, Ondrej
    [J]. JOURNAL OF SUPERCOMPUTING, 2019, 75 (12): : 7765 - 7777