Extending Shared-Memory Computations to Multiple Distributed Nodes

被引：0

作者：

Ahmed, Waseem ^{[1
]}

机构：

[1] King Abdulaziz Univ, Dept Comp Sci, Fac Comp & Informat Technol, Jeddah, Saudi Arabia

来源：

INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS | 2020年 / 11卷 / 08期

关键词：

GPU; OpenMP; shared memory programming; distributed programming; CUDA; MATRIX MULTIPLICATION; PERFORMANCE; PROGRAMS; OPENMP;

D O I：

10.14569/IJACSA.2020.0110882

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

With the emergence of accelerators like GPUs, MICs and FPGAs, the availability of domain specific libraries (like MKL) and the ease of parallelization associated with CUDA and OpenMP based shared-memory programming, node-based parallelization has recently become a popular choice among developers in the field of scientific computing. This is evident from the large volume of recently published work in various domains of scientific computing, where shared-memory programming and accelerators have been used to accelerate applications. Although these approaches are suitable for small problem-sizes, there are issues that need to be addressed for them to be applicable to larger input domains. Firstly, the primary focus of these works has been to accelerate the core kernel; acceleration of input/output operations is seldom considered. Many operations in scientific computing operate on large matrices-both sparse and dense - that are read from and written to external files. These input-output operations present themselves as bottlenecks and significantly effect the overall application time. Secondly, node-based parallelization limits a developer from distributing the computation beyond a single node without him having to learn an additional programming paradigm like MPI. Thirdly, the problem size that can be effectively handled by a node is limited by the memory of the node and accelerator. In this paper, an Asynchronous Multi-node Execution (AMNE) approach is presented that uses a unique combination of the shared-file system and pseudo-replication to extend node-based algorithms to a distributed multiple node implementation with minimal changes to the original node-based code. We demonstrate this approach by applying it to GEMM, a popular kernel in dense linear algebra and show that the presented methodology significantly advances the state of art in the field of parallelization and scientific computing.

引用

页码：675 / 685

页数：11

共 50 条

[21] Automatic Parallelization of Kernels in Shared-Memory Multi-GPU Nodes
Cabezas, Javier
Vilanova, Lluis
Gelado, Isaac
Jablin, Thomas B.
Navarro, Nacho
Hwu, Wen-mei W.
[J]. PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING (ICS'15), 2015, : 3 - 13
[22] UNIFYING DATA AND CONTROL TRANSFORMATIONS FOR DISTRIBUTED SHARED-MEMORY MACHINES
CIERNIAK, M
LI, W
[J]. SIGPLAN NOTICES, 1995, 30 (06): : 205 - 217
[23] SIMULATION ANALYSIS OF A MULTIPLE BUS SHARED-MEMORY MULTIPROCESSOR
MCCARRON, CW
TUNG, CH
[J]. SIMULATION, 1993, 61 (03) : 169 - 175
[24] Optimizing compiler for shared-memory multiple SIMD architecture
Zhang, Weihua
Qian, Xinglong
Wang, Ye
Zang, Binyu
Zhu, Chuanqi
[J]. ACM SIGPLAN NOTICES, 2006, 41 (07) : 199 - 208
[25] Adaptively scheduling parallel loops in distributed shared-memory systems
Yan, Y
Jin, CM
Zhang, XD
[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 1997, 8 (01) : 70 - 81
[26] Predicting reconfigurable interconnect performance in distributed shared-memory systems
Heirman, W.
Dambre, J.
Artundo, I.
Debaes, C.
Thienpont, H.
Stroobandt, D.
Van Campenhout, J.
[J]. INTEGRATION-THE VLSI JOURNAL, 2007, 40 (04) : 382 - 393
[27] ENSURING CORRECT ROLLBACK RECOVERY IN DISTRIBUTED SHARED-MEMORY SYSTEMS
JANSSENS, B
FUCHS, WK
[J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1995, 29 (02) : 211 - 218
[28] Analysis of failure recovery schemes for distributed shared-memory systems
Kim, JH
Vaidya, NH
[J]. IEE PROCEEDINGS-COMPUTERS AND DIGITAL TECHNIQUES, 1999, 146 (03): : 125 - 130
[29] Chip-Level Redundancy in Distributed Shared-Memory Multiprocessors
Gold, Brian T.
Falsafi, Babak
Hoe, Jarnes C.
[J]. IEEE 15TH PACIFIC RIM INTERNATIONAL SYMPOSIUM ON DEPENDABLE COMPUTING, PROCEEDINGS, 2009, : 195 - +
[30] The distributed virtual shared-memory system based on the InfiniBand architecture
Park, I
Kim, SW
[J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2005, 65 (10) : 1271 - 1280

← 1 2 3 4 5 →