Programming parallel dense matrix factorizations and inversion for new-generation NUMA architectures

被引：1

作者：

Catalan, Sandra ^{[1
]}

Igual, Francisco D. ^{[1
]}

Herrero, Jose R. ^{[2
]}

Rodriguez-Sanchez, Rafael ^{[1
]}

Quintana-Orti, Enrique S. ^{[3
]}

机构：

[1] Univ Complutense Madrid, Dept Arquitectura Comp & Automat, Madrid, Spain

[2] Univ Politecn Cataluna, Dept Arquitectura Comp, Barcelona, Spain

[3] Univ Politecn Valencia, Dept Informat Sistemas & Comp, Valencia, Spain

来源：

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING | 2023年 / 175卷

关键词：

NUMA architectures; Chiplets; Dense linear algebra; Shared memory programming; Portability; PERFORMANCE; SYSTEMS;

D O I：

10.1016/j.jpdc.2023.01.004

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

We propose a methodology to address the programmability issues derived from the emergence of newgeneration shared-memory NUMA architectures. For this purpose, we employ dense matrix factorizations and matrix inversion (DMFI) as a use case, and we target two modern architectures (AMD Rome and Huawei Kunpeng 920) that exhibit configurable NUMA topologies. Our methodology pursues performance portability across different NUMA configurations by proposing multi-domain implementations for DMFI plus a hybrid task- and loop-level parallelization that configures multi-threaded executions to fix core-todata binding, exploiting locality at the expense of minor code modifications. In addition, we introduce a generalization of the multi-domain implementations for DMFI that offers support for virtually any NUMA topology in present and future architectures. Our experimentation on the two target architectures for three representative dense linear algebra operations validates the proposal, reveals insights on the necessity of adapting both the codes and their execution to improve data access locality, and reports performance across architectures and inter- and intra-socket NUMA configurations competitive with state-of-the-art message-passing implementations, maintaining the ease of development usually associated with shared-memory programming. (c) 2023 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY-NC-ND

引用

页码：51 / 65

页数：15

共 29 条

[1] NUMA-Aware Dense Matrix Factorizations and Inversion with Look-Ahead on Multicore Processors
Catalan, Sandra
Igual, Francisco D.
Rodriguez-Sanchez, Rafael
Herrero, Jose R.
Quintana-Orti, Enrique S.
[J]. 2022 IEEE 34TH INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING (SBAC-PAD 2022), 2022, : 91 - 99
[2] Programming parallel dense matrix factorizations with look-ahead and OpenMP
Catalan, Sandra
Castello, Adrian
Igual, Francisco D.
Rodriguez-Sanchez, Rafael
Quintana-Orti, Enrique S.
[J]. CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2020, 23 (01): : 359 - 375
[3] Programming parallel dense matrix factorizations with look-ahead and OpenMP
Sandra Catalán
Adrián Castelló
Francisco D. Igual
Rafael Rodríguez-Sánchez
Enrique S. Quintana-Ortí
[J]. Cluster Computing, 2020, 23 : 359 - 375
[4] Task-Parallel Programming on NUMA Architectures
Terboven, Christian
Schmidl, Dirk
Cramer, Tim
Mey, Dieter An
[J]. EURO-PAR 2012 PARALLEL PROCESSING, 2012, 7484 : 638 - 649
[5] Performance analysis of four parallel programming models on NUMA architectures
Mohamed, AS
Cantonnet, F
[J]. PARALLEL AND DISTRIBUTED COMPUTING SYSTEMS, PROCEEDINGS, 2003, : 119 - 125
[6] Dense matrix computations on NUMA architectures with distance-aware work stealing
Al-Omairy, Rabab
Miranda, Guillermo
Ltaief, Hatem
Badia, Rosa M.
Martorell, Xavier
Labarta, Jesus
Keyes, David
[J]. Supercomputing Frontiers and Innovations, 2015, 2 (01) : 49 - 72
[7] Automatic Generation of Decomposition based Matrix Inversion Architectures
Irturk, Ali
Benson, Bridget
Arfaee, Arash
Kastner, Ryan
[J]. PROCEEDINGS OF THE 2008 INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE TECHNOLOGY, 2008, : 373 - 376
[8] GUSTO: An Automatic Generation and Optimization Tool for Matrix Inversion Architectures
Irturk, Ali
Benson, Bridget
Mirzaei, Shahnam
Kastner, Ryan
[J]. ACM TRANSACTIONS ON EMBEDDED COMPUTING SYSTEMS, 2010, 9 (04)
[9] A Fast Parallel Matrix Inversion Algorithm based on Heterogeneous Multicore Architectures
Yu, Denggao
He, Shiwen
Huang, Yongming
Yu, Guangshi
Yang, Luxi
[J]. 2015 IEEE GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING (GLOBALSIP), 2015, : 903 - 907
[10] Parallel Optimization of BLAS on a New-Generation Sunway Supercomputer
Ren, Yinqiao
Xu, Yi
[J]. JOURNAL OF CIRCUITS SYSTEMS AND COMPUTERS, 2023, 32 (17)

← 1 2 3 →