Optimized On-Chip-Pipelining for Memory-Intensive Computations on Multi-Core Processors with Explicit Memory Hierarchy

被引:0
|
作者
Keller, Joerg [1 ]
Kessler, Christoph W. [2 ]
Hulten, Rikard [2 ]
机构
[1] FernUniv, Hagen, Germany
[2] Linkopings Univ, Linkoping, Sweden
关键词
parallel merge sort; on-chip pipelining; multicore computing; task mapping; streaming computations; ALGORITHMS;
D O I
暂无
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Limited bandwidth to off-chip main memory tends to be a performance bottleneck in chip multiprocessors, and this will become even more problematic with an increasing number of cores. Especially for streaming computations where the ratio between computational work and memory transfer is low, transforming the program into more memory-efficient code is an important program optimization. On-chip pipelining reorganizes the computation so that partial results of subtasks are forwarded immediately between the cores over the high-bandwidth internal network, in order to reduce the volume of main memory accesses, and thereby improves the throughput for memory-intensive computations. At the same time, throughput is also constrained by the limited amount of on-chip memory available for buffering forwarded data. By optimizing the mapping of tasks to cores, balancing a trade-off between load balancing, buffer memory consumption, and communication load on the on-chip network, a larger buffer size can be applied, resulting in less DMA communication and scheduling overhead. In this article, we consider parallel mergesort as a representative memory-intensive application in detail, and focus on the global merging phase, which is dominating the overall sorting time for larger data sets. We work out the technical issues of applying the on-chip pipelining technique, and present several algorithms for optimized mapping of merge trees to the multiprocessor cores. We also demonstrate how some of these algorithms can be used for mapping of other streaming task graphs. We describe an implementation of pipelined parallel mergesort for the Cell Broadband Engine, which serves as an exemplary target. We evaluate experimentally the influence of buffer sizes and mapping optimizations, and show that optimized on-chip pipelining indeed speeds up, for realistic problem sizes, merging times by up to 70% on QS20 and 143% on PS3 compared to the merge phase of CellSort, which was by now the fastest merge sort implementation on Cell.
引用
收藏
页码:1987 / 2023
页数:37
相关论文
共 34 条
  • [21] Hybrid Memory Architecture for Voltage Scaling in Ultra-Low Power Multi-Core Biomedical Processors
    Bortolotti, Daniele
    Bartolini, Andrea
    Weis, Christian
    Rossi, Davide
    Benini, Luca
    2014 DESIGN, AUTOMATION AND TEST IN EUROPE CONFERENCE AND EXHIBITION (DATE), 2014,
  • [22] Using Criticality of GPU Accesses in Memory Management for CPU-GPU Heterogeneous Multi-Core Processors
    Rai, Siddharth
    Chaudhuri, Mainak
    ACM TRANSACTIONS ON EMBEDDED COMPUTING SYSTEMS, 2017, 16
  • [23] POSTER: Fault-tolerant Execution on COTS Multi-core Processors with Hardware Transactional Memory Support
    Haas, Florian
    Weis, Sebastian
    Ungerer, Theo
    Pokam, Gilles
    Wu, Youfeng
    2016 INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURE AND COMPILATION TECHNIQUES (PACT), 2016, : 421 - 422
  • [24] Off-Chip Memory Bandwidth Minimization through Cache Partitioning for Multi-Core Platforms
    Yu, Chenjie
    Petrov, Peter
    PROCEEDINGS OF THE 47TH DESIGN AUTOMATION CONFERENCE, 2010, : 132 - 137
  • [25] Adaptive and Speculative Memory Consistency Support for Multi-core Architectures with On-Chip Local Memories
    Vujic, Nikola
    Alvarez, Lluc
    Gonzalez Tallada, Marc
    Martorell, Xavier
    Ayguade, Eduard
    LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING, 2010, 5898 : 218 - +
  • [26] The Design and Implementation of a Heterogeneous Multi-core Security Chip architecture Based on Shared Memory System
    Zhang, Lei
    Dong, Renping
    Zhang, Chang
    Yu, Yaping
    MECHANICAL COMPONENTS AND CONTROL ENGINEERING III, 2014, 668-669 : 1314 - 1318
  • [27] Long Short-Term Memory Neural Network-based Power Forecasting of Multi-Core Processors
    Sagi, Mark
    Rapp, Martin
    Khdr, Heba
    Zhang, Yizhe
    Fasfous, Nael
    Nguyen Anh Vu Doan
    Wild, Thomas
    Henkel, Joerg
    Herkersdorf, Andreas
    PROCEEDINGS OF THE 2021 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE 2021), 2021, : 1685 - 1690
  • [28] Realizing Out-of-Core Stencil Computations using Multi-Tier Memory Hierarchy on GPGPU Clusters
    Endo, Toshio
    2016 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2016, : 21 - 29
  • [29] Realization of SMS4 Algorithm Based on Share Memory of the Heterogeneous Multi-Core Password Chip System
    Zhang, Lei
    Dong, Renping
    Yu, Yaping
    MECHANICAL COMPONENTS AND CONTROL ENGINEERING III, 2014, 668-669 : 1368 - 1373
  • [30] A New Parallel Symmetric Tridiagonal Eigensolver Based on Bisection and Inverse Iteration Algorithms for Shared-memory Multi-core Processors
    Ishigami, Hiroyuki
    Kimura, Kinji
    Nakamura, Yoshimasa
    2015 10TH INTERNATIONAL CONFERENCE ON P2P, PARALLEL, GRID, CLOUD AND INTERNET COMPUTING (3PGCIC), 2015, : 216 - 223