Optimized On-Chip-Pipelining for Memory-Intensive Computations on Multi-Core Processors with Explicit Memory Hierarchy

被引:0
|
作者
Keller, Joerg [1 ]
Kessler, Christoph W. [2 ]
Hulten, Rikard [2 ]
机构
[1] FernUniv, Hagen, Germany
[2] Linkopings Univ, Linkoping, Sweden
关键词
parallel merge sort; on-chip pipelining; multicore computing; task mapping; streaming computations; ALGORITHMS;
D O I
暂无
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Limited bandwidth to off-chip main memory tends to be a performance bottleneck in chip multiprocessors, and this will become even more problematic with an increasing number of cores. Especially for streaming computations where the ratio between computational work and memory transfer is low, transforming the program into more memory-efficient code is an important program optimization. On-chip pipelining reorganizes the computation so that partial results of subtasks are forwarded immediately between the cores over the high-bandwidth internal network, in order to reduce the volume of main memory accesses, and thereby improves the throughput for memory-intensive computations. At the same time, throughput is also constrained by the limited amount of on-chip memory available for buffering forwarded data. By optimizing the mapping of tasks to cores, balancing a trade-off between load balancing, buffer memory consumption, and communication load on the on-chip network, a larger buffer size can be applied, resulting in less DMA communication and scheduling overhead. In this article, we consider parallel mergesort as a representative memory-intensive application in detail, and focus on the global merging phase, which is dominating the overall sorting time for larger data sets. We work out the technical issues of applying the on-chip pipelining technique, and present several algorithms for optimized mapping of merge trees to the multiprocessor cores. We also demonstrate how some of these algorithms can be used for mapping of other streaming task graphs. We describe an implementation of pipelined parallel mergesort for the Cell Broadband Engine, which serves as an exemplary target. We evaluate experimentally the influence of buffer sizes and mapping optimizations, and show that optimized on-chip pipelining indeed speeds up, for realistic problem sizes, merging times by up to 70% on QS20 and 143% on PS3 compared to the merge phase of CellSort, which was by now the fastest merge sort implementation on Cell.
引用
收藏
页码:1987 / 2023
页数:37
相关论文
共 34 条
  • [1] An energy-efficient scheduling approach for memory-intensive tasks in multi-core systems
    Maurya A.K.
    Meena A.
    Singh D.
    Kumar V.
    International Journal of Information Technology, 2022, 14 (6) : 2793 - 2801
  • [2] A novel memory management method for multi-core processors
    Tu, Jih-Fu
    COMPUTERS & ELECTRICAL ENGINEERING, 2016, 51 : 184 - 194
  • [3] Memory Centric Hardware Prefetching in Multi-core Processors
    Zhu, Danfeng
    Wang, Rui
    Luan, Zhongzhi
    Qian, Depei
    Zhang, Han
    Cai, Jihong
    TRUSTWORTHY COMPUTING AND SERVICES (ISCTCS 2014), 2015, 520 : 311 - 321
  • [4] Experimental Study of Multithreading to Improve Memory Hierarchy Performance of Multi-core Processors for Scientific Applications
    Bajrovic, Enes
    Mehofer, Eduard
    CISIS: 2009 INTERNATIONAL CONFERENCE ON COMPLEX, INTELLIGENT AND SOFTWARE INTENSIVE SYSTEMS, VOLS 1 AND 2, 2009, : 645 - 650
  • [5] The Cache-Core Architecture to Enhance the Memory Performance on Multi-Core Processors
    Mori, Yosuke
    Kise, Kenji
    2009 INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING, APPLICATIONS AND TECHNOLOGIES (PDCAT 2009), 2009, : 445 - 450
  • [6] 3D-Stacked memory architectures for multi-core processors
    Loh, Gabriel H.
    ISCA 2008 PROCEEDINGS: 35TH INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, 2008, : 453 - 464
  • [7] Vector Class on Limited Local Memory (LLM) Multi-core Processors
    Bai, Ke
    Lu, Di
    Shrivastava, Aviral
    PROCEEDINGS OF THE PROCEEDINGS OF THE 14TH INTERNATIONAL CONFERENCE ON COMPILERS, ARCHITECTURES AND SYNTHESIS FOR EMBEDDED SYSTEMS (CASES '11), 2011, : 215 - 224
  • [8] Time-predictable Distributed Shared Memory for Multi-core Processors
    Petersen, Morten B.
    Riber, Anthon V.
    Andersen, Simon T.
    Schoeberl, Martin
    2018 IEEE NORDIC CIRCUITS AND SYSTEMS CONFERENCE (NORCAS): NORCHIP AND INTERNATIONAL SYMPOSIUM OF SYSTEM-ON-CHIP (SOC), 2018,
  • [9] An interactive and dynamic scratchpad memory management strategy for multi-core processors
    Tabbassum, Kavita
    Talpur, Shahnawaz
    Khahro, Shahnawaz Farhan
    MICROPROCESSORS AND MICROSYSTEMS, 2022, 92
  • [10] EXPLOITING DIRECT ACCESS SHARED MEMORY FOR MPI ON MULTI-CORE PROCESSORS
    Brightwell, Ron
    INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2010, 24 (01): : 69 - 77