Extending Shared-Memory Computations to Multiple Distributed Nodes

被引:0
|
作者
Ahmed, Waseem [1 ]
机构
[1] King Abdulaziz Univ, Dept Comp Sci, Fac Comp & Informat Technol, Jeddah, Saudi Arabia
关键词
GPU; OpenMP; shared memory programming; distributed programming; CUDA; MATRIX MULTIPLICATION; PERFORMANCE; PROGRAMS; OPENMP;
D O I
10.14569/IJACSA.2020.0110882
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
With the emergence of accelerators like GPUs, MICs and FPGAs, the availability of domain specific libraries (like MKL) and the ease of parallelization associated with CUDA and OpenMP based shared-memory programming, node-based parallelization has recently become a popular choice among developers in the field of scientific computing. This is evident from the large volume of recently published work in various domains of scientific computing, where shared-memory programming and accelerators have been used to accelerate applications. Although these approaches are suitable for small problem-sizes, there are issues that need to be addressed for them to be applicable to larger input domains. Firstly, the primary focus of these works has been to accelerate the core kernel; acceleration of input/output operations is seldom considered. Many operations in scientific computing operate on large matrices-both sparse and dense - that are read from and written to external files. These input-output operations present themselves as bottlenecks and significantly effect the overall application time. Secondly, node-based parallelization limits a developer from distributing the computation beyond a single node without him having to learn an additional programming paradigm like MPI. Thirdly, the problem size that can be effectively handled by a node is limited by the memory of the node and accelerator. In this paper, an Asynchronous Multi-node Execution (AMNE) approach is presented that uses a unique combination of the shared-file system and pseudo-replication to extend node-based algorithms to a distributed multiple node implementation with minimal changes to the original node-based code. We demonstrate this approach by applying it to GEMM, a popular kernel in dense linear algebra and show that the presented methodology significantly advances the state of art in the field of parallelization and scientific computing.
引用
收藏
页码:675 / 685
页数:11
相关论文
共 50 条
  • [21] Automatic Parallelization of Kernels in Shared-Memory Multi-GPU Nodes
    Cabezas, Javier
    Vilanova, Lluis
    Gelado, Isaac
    Jablin, Thomas B.
    Navarro, Nacho
    Hwu, Wen-mei W.
    [J]. PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING (ICS'15), 2015, : 3 - 13
  • [22] UNIFYING DATA AND CONTROL TRANSFORMATIONS FOR DISTRIBUTED SHARED-MEMORY MACHINES
    CIERNIAK, M
    LI, W
    [J]. SIGPLAN NOTICES, 1995, 30 (06): : 205 - 217
  • [23] SIMULATION ANALYSIS OF A MULTIPLE BUS SHARED-MEMORY MULTIPROCESSOR
    MCCARRON, CW
    TUNG, CH
    [J]. SIMULATION, 1993, 61 (03) : 169 - 175
  • [24] Optimizing compiler for shared-memory multiple SIMD architecture
    Zhang, Weihua
    Qian, Xinglong
    Wang, Ye
    Zang, Binyu
    Zhu, Chuanqi
    [J]. ACM SIGPLAN NOTICES, 2006, 41 (07) : 199 - 208
  • [25] Adaptively scheduling parallel loops in distributed shared-memory systems
    Yan, Y
    Jin, CM
    Zhang, XD
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 1997, 8 (01) : 70 - 81
  • [26] Predicting reconfigurable interconnect performance in distributed shared-memory systems
    Heirman, W.
    Dambre, J.
    Artundo, I.
    Debaes, C.
    Thienpont, H.
    Stroobandt, D.
    Van Campenhout, J.
    [J]. INTEGRATION-THE VLSI JOURNAL, 2007, 40 (04) : 382 - 393
  • [27] ENSURING CORRECT ROLLBACK RECOVERY IN DISTRIBUTED SHARED-MEMORY SYSTEMS
    JANSSENS, B
    FUCHS, WK
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1995, 29 (02) : 211 - 218
  • [28] Analysis of failure recovery schemes for distributed shared-memory systems
    Kim, JH
    Vaidya, NH
    [J]. IEE PROCEEDINGS-COMPUTERS AND DIGITAL TECHNIQUES, 1999, 146 (03): : 125 - 130
  • [29] Chip-Level Redundancy in Distributed Shared-Memory Multiprocessors
    Gold, Brian T.
    Falsafi, Babak
    Hoe, Jarnes C.
    [J]. IEEE 15TH PACIFIC RIM INTERNATIONAL SYMPOSIUM ON DEPENDABLE COMPUTING, PROCEEDINGS, 2009, : 195 - +
  • [30] The distributed virtual shared-memory system based on the InfiniBand architecture
    Park, I
    Kim, SW
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2005, 65 (10) : 1271 - 1280