Hardware Accelerator Integration Tradeoffs for High-Performance Computing: A Case Study of GEMM Acceleration in N-Body Methods

被引:3
|
作者
Asri, Mochamad [1 ]
Malhotra, Dhairya [2 ]
Wang, Jiajun [1 ]
Biros, George [3 ]
John, Lizy K. [1 ]
Gerstlauer, Andreas [1 ]
机构
[1] Univ Texas Austin, Elect & Comp Engn Dept, Austin, TX 78712 USA
[2] Flatiron Inst, New York, NY 10010 USA
[3] Univ Texas Austin, Inst Computat Engn & Sci, Austin, TX 78712 USA
关键词
System-on-chip; Acceleration; Random access memory; Optimization; Couplings; Computer architecture; Software;
D O I
10.1109/TPDS.2021.3056045
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In this article, we study performance and energy saving benefits of hardware acceleration under different hardware configurations and usage scenarios for a state-of-the-art Fast Multipole Method (FMM), which is a popular N-body method. We use a dedicated Application Specific Integrated Circuit (ASIC) to accelerate General Matrix-Matrix Multiply (GEMM) operations. FMM is widely used in applications and is representative example of the workload for many HPC applications. We compare architectures that integrate the GEMM ASIC next to, in or near main memory with an on-chip coupling aimed at minimizing or avoiding repeated round-trip transfers through DRAM for communication between accelerator and CPU. We study tradeoffs using detailed and accurately calibrated x86 CPU, accelerator and DRAM simulations. Our results show that simply moving accelerators closer to the chip does not necessarily lead to performance/energy gains. We demonstrate that, while careful software blocking and on-chip placement optimizations can reduce DRAM accesses by 2X over a naive on-chip integration, these dramatic savings in DRAM traffic do not automatically translate into significant total energy or runtime savings. This is chiefly due to the application characteristics, the high idle power and effective hiding of memory latencies in modern systems. Only when more aggressive co-optimizations such as software pipelining and overlapping are applied, additional performance and energy savings can be unlocked by 37 and 35 percent respectively over baseline acceleration. When similar optimizations (pipelining and overlapping) are applied with an off-chip integration, on-chip integration delivers up to 20 percent better performance and 17 percent less total energy consumption than off-chip integration.
引用
收藏
页码:2035 / 2048
页数:14
相关论文
共 28 条
  • [1] High-performance computing for classic gravitational N-body systems
    Capuzzo-Dolcetta, R.
    [J]. NUOVO CIMENTO DELLA SOCIETA ITALIANA DI FISICA C-COLLOQUIA ON PHYSICS, 2009, 32 (02): : 33 - 36
  • [2] Fault-Tolerant Hardware Acceleration for High-Performance Edge-Computing Nodes
    Barbirotta, Marcello
    Cheikh, Abdallah
    Mastrandrea, Antonio
    Menichelli, Francesco
    Angioli, Marco
    Jamili, Saeid
    Olivieri, Mauro
    [J]. ELECTRONICS, 2023, 12 (17)
  • [3] Portal: A High-Performance Language and Compiler for Parallel N-body Problems
    Beni, Laleh Aghababaie
    Ramanan, Saikiran
    Chandramowlishwaran, Aparna
    [J]. 2019 IEEE 33RD INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2019), 2019, : 974 - 985
  • [4] The implementation and performance evaluation of N-body gravitational simulation algorithm on high-performance computers
    Almojel, AI
    [J]. COMPUTERS & ELECTRICAL ENGINEERING, 2000, 26 (3-4) : 297 - 316
  • [5] High-performance direct gravitational N-body simulations on graphics processing units
    Portegies Zwart, Simon
    Belleman, Robert G.
    Geldof, Peter M.
    [J]. NEW ASTRONOMY, 2007, 12 (08) : 641 - 650
  • [6] Power Efficient Design of High-Performance Convolutional Neural Networks Hardware Accelerator on FPGA: A Case Study With GoogLeNet
    Abd El-Maksoud, Ahmed J.
    Ebbed, Mohamed
    Khalil, Ahmed H.
    Mostafa, Hassan
    [J]. IEEE ACCESS, 2021, 9 : 151897 - 151911
  • [7] PETAR: a high-performance N-body code for modelling massive collisional stellar systems
    Wang, Long
    Iwasawa, Masaki
    Nitadori, Keigo
    Makino, Junichiro
    [J]. MONTHLY NOTICES OF THE ROYAL ASTRONOMICAL SOCIETY, 2020, 497 (01) : 536 - 555
  • [8] High-Throughput Computing on High-Performance Platforms: A Case Study
    Oleynik, Danila
    Panitkin, Sergey
    Turilli, Matteo
    Angius, Alessio
    Oral, Sarp
    De, Kaushik
    Klimentov, Alexei
    Wells, Jack C.
    Jha, Shantenu
    [J]. 2017 IEEE 13TH INTERNATIONAL CONFERENCE ON E-SCIENCE (E-SCIENCE), 2017, : 295 - 304
  • [9] A case study of a distributed high-performance computing system for neurocomputing
    Anguita, D
    Boni, A
    Parodi, G
    [J]. JOURNAL OF SYSTEMS ARCHITECTURE, 2000, 46 (05) : 429 - 438
  • [10] Parallel Backprojection: A Case Study in High-Performance Reconfigurable Computing
    Cordes, Ben
    Leeser, Miriam
    [J]. EURASIP JOURNAL ON EMBEDDED SYSTEMS, 2009, (01)