Shared⁃memory parallelization technology of unstructured CFD solver for multi⁃core CPU/many⁃core GPU architecture

被引：0

作者：

Zhang J. ^{[1
,2
]}

Li R. ^{[2
]}

Deng L. ^{[2
]}

Dai Z. ^{[2
]}

Liu J. ^{[1
]}

Xu C. ^{[1
]}

机构：

[1] National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, Changsha

[2] Computational Aerodynamic Institute, China Aerodynamic Research and Development Center, Mianyang

来源：

Hangkong Xuebao/Acta Aeronautica et Astronautica Sinica | 2024年 / 45卷 / 07期

关键词：

CFD; GPU; memory access optimization; shared memory parallelization; unstructured-grid;

D O I：

10.7527/S1000-6893.2023.28888

中图分类号：

学科分类号：

摘要：

Shared memory parallelization for unstructured CFD on modern high-performance computer architecture is the key to improve the efficiency of floating point computing and realizing large-scale fluid simulation application capa⁃ bilities. However，due to problems such as the complex topological relationship，poor data locality，and data write conflict in unstructured CFD computing，parallelization of the traditional algorithms in shared memory to efficiently ex⁃ plore the hardware capabilities of multi-core CPUs/many-core GPUs has become a significant challenge. Starting from industrial-level unstructured CFD software，a variety of shared memory parallel algorithms are designed and imple⁃ mented by deeply analyzing the computing behavior and memory access mode，and data locality optimization tech⁃ nologies such as grid reordering，loop fusion，and multi-level memory access are used to further improve perfor⁃ mance. Specifically，a comprehensive study is conducted on two parallel modes，loop-based and task-based，for multi-core CPU architectures. An innovative reduction parallel strategy based on a multi-level memory access optimiza⁃ tion method is proposed for the many-core GPU architecture. All the parallel methods and optimization techniques implemented are deeply analyzed and evaluated by the test cases of the M6 wing and CHN-T1 airplane. The results show that the parallel strategy of division and replication has the best performance on the CPU platform. Using Cuthill-McKee grid renumbering and loop fusion techniques to optimize memory access can improve performance by 10%，respectively. For GPU platforms，the proposed reduction strategy combined with multi-level memory access optimiza⁃ tion has a significant acceleration effect. For the hot spot subroutine with data racing，the speed-up can be further im⁃ proved by 3 times，and the overall speed-up can reach 127. © 2024 Chinese Society of Astronautics. All rights reserved.

引用

共 27 条

[1] YAN C., Achievements and predicaments of CFD in aeronautics in past forty years［J］, Acta Aeronautica et Astronautica Sinica, 43, 10, (2022)
[2] ZHANG Z P，, ZHAO Z，, CHEN J Q，, Et al., Development and verification of LES model in NNW-PHengLEI［J］, Acta Aeronautica et Astronautica Sinica, 44, 6, (2023)
[3] KROLL N, Et al., DLR Project Digital-X：Towards virtual aircraft design and flight testing based on high-fidelity methods［J］, CEAS Aeronautical Journal, 7, 1, pp. 3-27, (2016)
[4] LIU P X, YUAN X X, SUN D，, Et al., Direct numerical simulation of high-temperature turbulent boundary layer with chemical nonequilibrium［J］, Acta Aeronautica et Astronautica Sinica, 43, 1, (2022)
[5] ZHANG L P, HE L, Et al., The opportunity and grand challenges in computational fluid dynamics by exascale computing［J］, Acta Aerodynamica Sinica, 34, 4, pp. 405-417, (2016)
[6] LIU S, Et al., A self-designed heterogeneous accelerator for exascale high performance computing［J］, Journal of Computer Research and Development, 58, 6, pp. 1234-1237, (2021)
[7] GONG C Y, LIU J, BAO W M，, Et al., Review on ecological construction of domestic high-performance parallel application software in post Moore era［J］, Journal of System Simulation, 34, 10, pp. 2107-2118, (2022)
[8] CARY A, DUQUE E，, Et al., Realizing the vision of CFD in 2030［J］, Computing in Science & Engineering, 24, 1, pp. 64-70, (2022)
[9] BANSAL G，, Et al., Performance optimizations for scalable implicit RANS calculations with SU2［J］, Computers & Fluids, 129, pp. 146-158, (2016)
[10] GARCIA-GASULLA M，, HOUZEAUX G，, FERRER R，, Et al., MPI+X：Task-based parallelisation and dynamic load balance of finite element assembly［J］, International Journal of Computational Fluid Dynamics, 33, 3, pp. 115-136, (2019)

← 1 2 3 →