Programming parallel dense matrix factorizations and inversion for new-generation NUMA architectures

被引:1
|
作者
Catalan, Sandra [1 ]
Igual, Francisco D. [1 ]
Herrero, Jose R. [2 ]
Rodriguez-Sanchez, Rafael [1 ]
Quintana-Orti, Enrique S. [3 ]
机构
[1] Univ Complutense Madrid, Dept Arquitectura Comp & Automat, Madrid, Spain
[2] Univ Politecn Cataluna, Dept Arquitectura Comp, Barcelona, Spain
[3] Univ Politecn Valencia, Dept Informat Sistemas & Comp, Valencia, Spain
关键词
NUMA architectures; Chiplets; Dense linear algebra; Shared memory programming; Portability; PERFORMANCE; SYSTEMS;
D O I
10.1016/j.jpdc.2023.01.004
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We propose a methodology to address the programmability issues derived from the emergence of newgeneration shared-memory NUMA architectures. For this purpose, we employ dense matrix factorizations and matrix inversion (DMFI) as a use case, and we target two modern architectures (AMD Rome and Huawei Kunpeng 920) that exhibit configurable NUMA topologies. Our methodology pursues performance portability across different NUMA configurations by proposing multi-domain implementations for DMFI plus a hybrid task- and loop-level parallelization that configures multi-threaded executions to fix core-todata binding, exploiting locality at the expense of minor code modifications. In addition, we introduce a generalization of the multi-domain implementations for DMFI that offers support for virtually any NUMA topology in present and future architectures. Our experimentation on the two target architectures for three representative dense linear algebra operations validates the proposal, reveals insights on the necessity of adapting both the codes and their execution to improve data access locality, and reports performance across architectures and inter- and intra-socket NUMA configurations competitive with state-of-the-art message-passing implementations, maintaining the ease of development usually associated with shared-memory programming. (c) 2023 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY-NC-ND
引用
收藏
页码:51 / 65
页数:15
相关论文
共 29 条
  • [1] NUMA-Aware Dense Matrix Factorizations and Inversion with Look-Ahead on Multicore Processors
    Catalan, Sandra
    Igual, Francisco D.
    Rodriguez-Sanchez, Rafael
    Herrero, Jose R.
    Quintana-Orti, Enrique S.
    [J]. 2022 IEEE 34TH INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING (SBAC-PAD 2022), 2022, : 91 - 99
  • [2] Programming parallel dense matrix factorizations with look-ahead and OpenMP
    Catalan, Sandra
    Castello, Adrian
    Igual, Francisco D.
    Rodriguez-Sanchez, Rafael
    Quintana-Orti, Enrique S.
    [J]. CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2020, 23 (01): : 359 - 375
  • [3] Programming parallel dense matrix factorizations with look-ahead and OpenMP
    Sandra Catalán
    Adrián Castelló
    Francisco D. Igual
    Rafael Rodríguez-Sánchez
    Enrique S. Quintana-Ortí
    [J]. Cluster Computing, 2020, 23 : 359 - 375
  • [4] Task-Parallel Programming on NUMA Architectures
    Terboven, Christian
    Schmidl, Dirk
    Cramer, Tim
    Mey, Dieter An
    [J]. EURO-PAR 2012 PARALLEL PROCESSING, 2012, 7484 : 638 - 649
  • [5] Performance analysis of four parallel programming models on NUMA architectures
    Mohamed, AS
    Cantonnet, F
    [J]. PARALLEL AND DISTRIBUTED COMPUTING SYSTEMS, PROCEEDINGS, 2003, : 119 - 125
  • [6] Dense matrix computations on NUMA architectures with distance-aware work stealing
    Al-Omairy, Rabab
    Miranda, Guillermo
    Ltaief, Hatem
    Badia, Rosa M.
    Martorell, Xavier
    Labarta, Jesus
    Keyes, David
    [J]. Supercomputing Frontiers and Innovations, 2015, 2 (01) : 49 - 72
  • [7] Automatic Generation of Decomposition based Matrix Inversion Architectures
    Irturk, Ali
    Benson, Bridget
    Arfaee, Arash
    Kastner, Ryan
    [J]. PROCEEDINGS OF THE 2008 INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE TECHNOLOGY, 2008, : 373 - 376
  • [8] GUSTO: An Automatic Generation and Optimization Tool for Matrix Inversion Architectures
    Irturk, Ali
    Benson, Bridget
    Mirzaei, Shahnam
    Kastner, Ryan
    [J]. ACM TRANSACTIONS ON EMBEDDED COMPUTING SYSTEMS, 2010, 9 (04)
  • [9] A Fast Parallel Matrix Inversion Algorithm based on Heterogeneous Multicore Architectures
    Yu, Denggao
    He, Shiwen
    Huang, Yongming
    Yu, Guangshi
    Yang, Luxi
    [J]. 2015 IEEE GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING (GLOBALSIP), 2015, : 903 - 907
  • [10] Parallel Optimization of BLAS on a New-Generation Sunway Supercomputer
    Ren, Yinqiao
    Xu, Yi
    [J]. JOURNAL OF CIRCUITS SYSTEMS AND COMPUTERS, 2023, 32 (17)