Scalable NUMA-Aware Wilson-Dirac on Supercomputers

被引：2

作者：

Tadonki, Claude ^{[1
]}

机构：

[1] PSL Res Univ, Mines ParisTech, CRI, 35 Rue St Honore, F-77305 Fontainebleau, France

来源：

2017 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING & SIMULATION (HPCS) | 2017年

关键词：

D O I：

10.1109/HPCS.2017.56

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

We revisit the Wilson-Dirac operator, also referred as Dslash, on NUMA manycore vector machines and thereby seek an efficient supercomputing implementation. Quantum ChromoDynamics (QCD) is the theory of the strong nuclear force and its discrete formalism is the so-called Lattice Quantum ChromoDynamics (LQCD). Wilson-Dirac is the major computing kernel in LQCD, where a special attention is paid to large scale simulations. The corresponding computing demand is tremendous at various levels from storage to floating-point operations, thus the crucial need for powerful supercomputers. Designing efficient LQCD codes on modern (mostly hybrid) supercomputers requires to efficiently exploit all available levels of parallelism including accelerators. Since Wilson-Dirac is a coarse-grain stencil computation performed on a huge volume of data, any performance and scalability related investigation should skillfully address memory accesses and interprocessor communication overheads. In order to lower the latter, explicit shared memory implementations should be considered at the level of a compute node, since this will lead to a less complex data communication graph and thus (at least intuitively) reduce the overall communication latency. We focus on this aspect and propose a novel efficient NUMA-aware scheduling, together with a combination of the major HPC strategies for large-scale LQCD. We reach nearly optimal performances on a single core and a significant scalability improvement on several NUMA nodes. Then, using a classical domain decomposition approach, we extend our scheduling to a large cluster of many-core nodes, thus illustrating the global efficiency of our hybrid implementation.

引用

页码：315 / 324

页数：10

共 50 条

[1] Scalable Adaptive NUMA-Aware Lock
Zhang, Mingzhe
Chen, Haibo
Cheng, Luwei
Lau, Francis C. M.
Wang, Cho-Li
[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2017, 28 (06) : 1754 - 1769
[2] Scalable NUMA-aware Blocking Synchronization Primitives
Kashyap, Sanidhya
Mm, Changwoo
Kim, Taesoo
[J]. 2017 USENIX ANNUAL TECHNICAL CONFERENCE (USENIX ATC '17), 2017, : 603 - 615
[3] NEMO: NUMA-aware Concurrency Control for Scalable Transactional Memory
Mohamedin, Mohamed
Peluso, Sebastiano
Kishi, Masoomeh Javidi
Hassan, Ahmed
Palmieri, Roberto
[J]. PROCEEDINGS OF THE 47TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, 2018,
[4] NUMA-aware Scalable Graph Traversal on SGI UV Systems
Yasui, Yuichiro
Fujisawa, Katsuki
Goh, Eng Lim
Baron, John
Sugiura, Atsushi
Uchiyama, Takashi
[J]. PROCEEDINGS OF THE ACM WORKSHOP ON HIGH PERFORMANCE GRAPH PROCESSING (HPGP'16), 2016, : 19 - 26
[5] On Designing NUMA-Aware Concurrency Control for Scalable Transactional Memory
Mohamedin, Mohamed
Palmieri, Roberto
Peluso, Sebastiano
Ravindran, Binoy
[J]. ACM SIGPLAN NOTICES, 2016, 51 (08) : 393 - 394
[6] NUMA-Aware Scalable and Efficient In-Memory Aggregation on Large Domains
Wang, Li
Zhou, Minqi
Zhang, Zhenjie
Shan, Ming-Chien
Zhou, Aoying
[J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2015, 27 (04) : 1071 - 1084
[7] Compact NUMA-aware Locks
Dice, Dave
Kogan, Alex
[J]. PROCEEDINGS OF THE FOURTEENTH EUROSYS CONFERENCE 2019 (EUROSYS '19), 2019,
[8] NUMA-Aware Task Performance Analysis
Schmidl, Dirk
Mueller, Matthias S.
[J]. OpenMP: Memory, Devices, and Tasks, 2016, 9903 : 77 - 88
[9] A NUMA-Aware Recoverable Mutex Lock
Fahmy, Ahmed
Golab, Wojciech
[J]. PROCEEDINGS OF THE 34TH ACM SYMPOSIUM ON PARALLELISM IN ALGORITHMS AND ARCHITECTURES, SPAA 2022, 2022, : 295 - 305
[10] A NUMA-Aware Recoverable Mutex Lock
Fahmy, Ahmed
Golab, Wojciech
[J]. Annual ACM Symposium on Parallelism in Algorithms and Architectures, 2022, : 295 - 305

← 1 2 3 4 5 →