Optimized On-Chip-Pipelining for Memory-Intensive Computations on Multi-Core Processors with Explicit Memory Hierarchy

被引：0

作者：

Keller, Joerg ^{[1
]}

Kessler, Christoph W. ^{[2
]}

Hulten, Rikard ^{[2
]}

机构：

[1] FernUniv, Hagen, Germany

[2] Linkopings Univ, Linkoping, Sweden

来源：

JOURNAL OF UNIVERSAL COMPUTER SCIENCE | 2012年 / 18卷 / 14期

关键词：

parallel merge sort; on-chip pipelining; multicore computing; task mapping; streaming computations; ALGORITHMS;

D O I：

暂无

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Limited bandwidth to off-chip main memory tends to be a performance bottleneck in chip multiprocessors, and this will become even more problematic with an increasing number of cores. Especially for streaming computations where the ratio between computational work and memory transfer is low, transforming the program into more memory-efficient code is an important program optimization. On-chip pipelining reorganizes the computation so that partial results of subtasks are forwarded immediately between the cores over the high-bandwidth internal network, in order to reduce the volume of main memory accesses, and thereby improves the throughput for memory-intensive computations. At the same time, throughput is also constrained by the limited amount of on-chip memory available for buffering forwarded data. By optimizing the mapping of tasks to cores, balancing a trade-off between load balancing, buffer memory consumption, and communication load on the on-chip network, a larger buffer size can be applied, resulting in less DMA communication and scheduling overhead. In this article, we consider parallel mergesort as a representative memory-intensive application in detail, and focus on the global merging phase, which is dominating the overall sorting time for larger data sets. We work out the technical issues of applying the on-chip pipelining technique, and present several algorithms for optimized mapping of merge trees to the multiprocessor cores. We also demonstrate how some of these algorithms can be used for mapping of other streaming task graphs. We describe an implementation of pipelined parallel mergesort for the Cell Broadband Engine, which serves as an exemplary target. We evaluate experimentally the influence of buffer sizes and mapping optimizations, and show that optimized on-chip pipelining indeed speeds up, for realistic problem sizes, merging times by up to 70% on QS20 and 143% on PS3 compared to the merge phase of CellSort, which was by now the fastest merge sort implementation on Cell.

引用

页码：1987 / 2023

页数：37

共 34 条

[21] Hybrid Memory Architecture for Voltage Scaling in Ultra-Low Power Multi-Core Biomedical Processors
Bortolotti, Daniele
Bartolini, Andrea
Weis, Christian
Rossi, Davide
Benini, Luca
2014 DESIGN, AUTOMATION AND TEST IN EUROPE CONFERENCE AND EXHIBITION (DATE), 2014,
[22] Using Criticality of GPU Accesses in Memory Management for CPU-GPU Heterogeneous Multi-Core Processors
Rai, Siddharth
Chaudhuri, Mainak
ACM TRANSACTIONS ON EMBEDDED COMPUTING SYSTEMS, 2017, 16
[23] POSTER: Fault-tolerant Execution on COTS Multi-core Processors with Hardware Transactional Memory Support
Haas, Florian
Weis, Sebastian
Ungerer, Theo
Pokam, Gilles
Wu, Youfeng
2016 INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURE AND COMPILATION TECHNIQUES (PACT), 2016, : 421 - 422
[24] Off-Chip Memory Bandwidth Minimization through Cache Partitioning for Multi-Core Platforms
Yu, Chenjie
Petrov, Peter
PROCEEDINGS OF THE 47TH DESIGN AUTOMATION CONFERENCE, 2010, : 132 - 137
[25] Adaptive and Speculative Memory Consistency Support for Multi-core Architectures with On-Chip Local Memories
Vujic, Nikola
Alvarez, Lluc
Gonzalez Tallada, Marc
Martorell, Xavier
Ayguade, Eduard
LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING, 2010, 5898 : 218 - +
[26] The Design and Implementation of a Heterogeneous Multi-core Security Chip architecture Based on Shared Memory System
Zhang, Lei
Dong, Renping
Zhang, Chang
Yu, Yaping
MECHANICAL COMPONENTS AND CONTROL ENGINEERING III, 2014, 668-669 : 1314 - 1318
[27] Long Short-Term Memory Neural Network-based Power Forecasting of Multi-Core Processors
Sagi, Mark
Rapp, Martin
Khdr, Heba
Zhang, Yizhe
Fasfous, Nael
Nguyen Anh Vu Doan
Wild, Thomas
Henkel, Joerg
Herkersdorf, Andreas
PROCEEDINGS OF THE 2021 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE 2021), 2021, : 1685 - 1690
[28] Realizing Out-of-Core Stencil Computations using Multi-Tier Memory Hierarchy on GPGPU Clusters
Endo, Toshio
2016 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2016, : 21 - 29
[29] Realization of SMS4 Algorithm Based on Share Memory of the Heterogeneous Multi-Core Password Chip System
Zhang, Lei
Dong, Renping
Yu, Yaping
MECHANICAL COMPONENTS AND CONTROL ENGINEERING III, 2014, 668-669 : 1368 - 1373
[30] A New Parallel Symmetric Tridiagonal Eigensolver Based on Bisection and Inverse Iteration Algorithms for Shared-memory Multi-core Processors
Ishigami, Hiroyuki
Kimura, Kinji
Nakamura, Yoshimasa
2015 10TH INTERNATIONAL CONFERENCE ON P2P, PARALLEL, GRID, CLOUD AND INTERNET COMPUTING (3PGCIC), 2015, : 216 - 223

← 1 2 3 4 →