OuterSPACE: An Outer Product based Sparse Matrix Multiplication Accelerator

被引:150
|
作者
Pal, Subhankar [1 ]
Beaumont, Jonathan [1 ]
Park, Dong-Hyeon [1 ]
Amarnath, Aporva [1 ]
Feng, Siying [1 ]
Chakrabarti, Chaitali [2 ]
Kim, Hun-Seok [1 ]
Blaauw, David [1 ]
Mudge, Trevor [1 ]
Dreslinski, Ronald [1 ]
机构
[1] Univ Michigan, Ann Arbor, MI 48109 USA
[2] Arizona State Univ, Tempe, AZ USA
关键词
Sparse matrix processing; application-specific hardware; parallel computer architecture; hardware-software codesign; hardware accelerators; ALGEBRA; PERFORMANCE;
D O I
10.1109/HPCA.2018.00067
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Sparse matrices are widely used in graph and data analytics, machine learning, engineering and scientific applications. This paper describes and analyzes OuterSPACE, an accelerator targeted at applications that involve large sparse matrices. OuterSPACE is a highly-scalable, energy-efficient, reconfigurable design, consisting of massively parallel Single Program, Multiple Data (SPMD)style processing units, distributed memories, high-speed crossbars and High Bandwidth Memory (HBM). We identify redundant memory accesses to non-zeros as a key bottleneck in traditional sparse matrix-matrix multiplication algorithms. To ameliorate this, we implement an outer product based matrix multiplication technique that eliminates redundant accesses by decoupling multiplication from accumulation. We demonstrate that traditional architectures, due to limitations in their memory hierarchies and ability to harness parallelism in the algorithm, are unable to take advantage of this reduction without incurring significant overheads. OuterSPACE is designed to specifically overcome these challenges. We simulate the key components of our architecture using gem5 on a diverse set of matrices from the University of Florida's SuiteSparse collection and the Stanford Network Analysis Project and show a mean speedup of 7.9x over Intel Math Kernel Library on a Xeon CPU, 13.0x against cuSPARSE and 14.0x against CUSP when run on an NVIDIA K40 GPU, while achieving an average throughput of 2.9 GFLOPS within a 24 W power budget in an area of 87 mm2.
引用
下载
收藏
页码:724 / 736
页数:13
相关论文
共 50 条
  • [1] MOSCON: Modified Outer Product based Sparse Matrix-Matrix Multiplication Accelerator with Configurable Tiles
    Noble, G.
    Nalesh, S.
    Kala, S.
    2023 36TH INTERNATIONAL CONFERENCE ON VLSI DESIGN AND 2023 22ND INTERNATIONAL CONFERENCE ON EMBEDDED SYSTEMS, VLSID, 2023, : 264 - 269
  • [2] MatRaptor: A Sparse-Sparse Matrix Multiplication Accelerator Based on Row-Wise Product
    Srivastava, Nitish
    Jin, Hanchen
    Liu, Jie
    Albonesi, David
    Zhang, Zhiru
    2020 53RD ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO 2020), 2020, : 766 - 780
  • [3] Towards Memristor based Accelerator for Sparse Matrix Vector Multiplication
    Cui, Jianwei
    Qiu, Qinru
    2016 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 2016, : 121 - 124
  • [4] Sparse-Sparse Matrix Multiplication Accelerator on FPGA featuring Distribute-Merge Product Dataflow
    Nagahara, Yuta
    Yan, Jiale
    Kawamura, Kazushi
    Motomura, Masato
    Chu, Thiem Van
    29TH ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE, ASP-DAC 2024, 2024, : 785 - 791
  • [5] Row-Wise Product-Based Sparse Matrix Multiplication Hardware Accelerator With Optimal Load Balancing
    Lee, Jong Hun
    Park, Beomjin
    Kong, Joonho
    Munir, Arslan
    IEEE ACCESS, 2022, 10 : 64547 - 64559
  • [6] SIMULTANEOUS INPUT AND OUTPUT MATRIX PARTITIONING FOR OUTER-PRODUCT-PARALLEL SPARSE MATRIX-MATRIX MULTIPLICATION
    Akbudak, Kadir
    Aykanat, Cevdet
    SIAM JOURNAL ON SCIENTIFIC COMPUTING, 2014, 36 (05): : C568 - C590
  • [7] A sparse matrix vector multiplication accelerator based on high-bandwidth memory
    Li, Tao
    Shen, Li
    COMPUTERS & ELECTRICAL ENGINEERING, 2023, 105
  • [8] Mentor: A Memory-Efficient Sparse-dense Matrix Multiplication Accelerator Based on Column-Wise Product
    Lu, Xiaobo
    Fang, Jianbin
    Peng, Lin
    Huang, Chun
    Du, Zidong
    Zhao, Yongwei
    Wang, Zheng
    ACM Transactions on Architecture and Code Optimization, 2024, 21 (04)
  • [9] An Efficient Gustavson-Based Sparse Matrix-Matrix Multiplication Accelerator on Embedded FPGAs
    Li, Shiqing
    Huai, Shuo
    Liu, Weichen
    IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2023, 42 (12) : 4671 - 4680
  • [10] InnerSP: A Memory Efficient Sparse Matrix Multiplication Accelerator with Locality-aware Inner Product Processing
    Baek, Daehyeon
    Hwang, Soojin
    Heo, Taekyung
    Kim, Daehoon
    Huh, Jaehyuk
    30TH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT 2021), 2021, : 116 - 128