Mentor: A Memory-Efficient Sparse-dense Matrix Multiplication Accelerator Based on Column-Wise Product

被引:0
|
作者
Lu, Xiaobo [1 ]
Fang, Jianbin [1 ]
Peng, Lin [1 ]
Huang, Chun [1 ]
Du, Zidong [2 ]
Zhao, Yongwei [2 ]
Wang, Zheng [3 ]
机构
[1] School of Computer Science and Technology, National University of Defense Technology, Changsha,60024350, China
[2] Institute of Computing Technology, Chinese Academy of Sciences, Beijing,60030904, China
[3] Northwest University, Xi'an,60001298, China
关键词
Analog storage - Computer software reusability - Digital storage - Hadrons - Hardware-software codesign - Input output programs - Linear accelerators - Matrix algebra - Photons - Printed circuit design - Program debugging - Semiconductor storage - Structural dynamics;
D O I
10.1145/3688612
中图分类号
学科分类号
摘要
Sparse-dense matrix multiplication (SpMM) is the performance bottleneck of many high-performance and deep-learning applications, making it attractive to design specialized SpMM hardware accelerators. Unfortunately, existing hardware solutions do not take full advantage of data reuse opportunities of the input and output matrices or suffer from irregular memory access patterns. Their strategies increase the off-chip memory traffic and bandwidth pressure, leaving much room for improvement. We present Mentor, a new approach to designing SpMM accelerators. Our key insight is that column-wise dataflow, while rarely exploited in prior works, can address these issues in SpMM computations. Mentor is a software-hardware co-design approach for leveraging column-wise dataflow to improve data reuse and regular memory accesses of SpMM. On the software level, Mentor incorporates a novel streaming construction scheme to preprocess the input matrix for enabling a streaming access pattern. On the hardware level, it employs a fully pipelined design to unlock the potential of column-wise dataflow further. The design of Mentor is underpinned by a carefully designed analytical model to find the tradeoff between performance and hardware resources. We have implemented an FPGA prototype of Mentor. Experimental results show that Mentor achieves speedup by geomean 2.05× (up to 3.98×), reduces the memory traffic by geomean 2.92× (up to 4.93×), and improves bandwidth utilization by geomean 1.38× (up to 2.89×), compared with the state-of-the-art hardware solutions. © 2024 Copyright held by the owner/author(s).
引用
收藏
相关论文
共 17 条
  • [1] REVISITING COLUMN-WISE VECTOR QUANTIZATION FOR MEMORY-EFFICIENT MATRIX MULTIPLICATION
    Matsui, Yusuke
    Satoh, Shin'ichi
    [J]. 2018 25TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2018, : 1158 - 1162
  • [2] An Efficient Sparse-Dense Matrix Multiplication on a Multicore System
    Yan, Di
    Wu, Tao
    Liu, Ying
    Gao, Yang
    [J]. 2017 17TH IEEE INTERNATIONAL CONFERENCE ON COMMUNICATION TECHNOLOGY (ICCT 2017), 2017, : 1880 - 1883
  • [3] MAPPARAT: A Resource Constrained FPGA-Based Accelerator for Sparse-Dense Matrix Multiplication
    Ashuthosh, M. R.
    Krishna, Santosh
    Sudarshan, Vishvas
    Subramaniyan, Srinivasan
    Purnaprajna, Madhura
    [J]. 2022 35TH INTERNATIONAL CONFERENCE ON VLSI DESIGN (VLSID 2022) HELD CONCURRENTLY WITH 2022 21ST INTERNATIONAL CONFERENCE ON EMBEDDED SYSTEMS (ES 2022), 2022, : 102 - 107
  • [4] GROW: A Row-Stationary Sparse-Dense GEMM Accelerator for Memory-Efficient Graph Convolutional Neural Networks
    Hwang, Ranggi
    Kang, Minhoo
    Lee, Jiwon
    Kam, Dongyun
    Lee, Youngjoo
    Rhu, Minsoo
    [J]. 2023 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE, HPCA, 2023, : 42 - 55
  • [5] MatRaptor: A Sparse-Sparse Matrix Multiplication Accelerator Based on Row-Wise Product
    Srivastava, Nitish
    Jin, Hanchen
    Liu, Jie
    Albonesi, David
    Zhang, Zhiru
    [J]. 2020 53RD ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO 2020), 2020, : 766 - 780
  • [6] SDMA: An Efficient and Flexible Sparse-Dense Matrix-Multiplication Architecture for GNNs
    Gao, Yingxue
    Gong, Lei
    Wang, Chao
    Wang, Teng
    Zhou, Xuehai
    [J]. 2022 32ND INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE LOGIC AND APPLICATIONS, FPL, 2022, : 307 - 312
  • [7] Efficient Sparse-Dense Matrix-Matrix Multiplication on GPUs Using the Customized Sparse Storage Format
    Shi, Shaohuai
    Wang, Qiang
    Chu, Xiaowen
    [J]. 2020 IEEE 26TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2020, : 19 - 26
  • [8] OuterSPACE: An Outer Product based Sparse Matrix Multiplication Accelerator
    Pal, Subhankar
    Beaumont, Jonathan
    Park, Dong-Hyeon
    Amarnath, Aporva
    Feng, Siying
    Chakrabarti, Chaitali
    Kim, Hun-Seok
    Blaauw, David
    Mudge, Trevor
    Dreslinski, Ronald
    [J]. 2018 24TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA), 2018, : 724 - 736
  • [9] InnerSP: A Memory Efficient Sparse Matrix Multiplication Accelerator with Locality-aware Inner Product Processing
    Baek, Daehyeon
    Hwang, Soojin
    Heo, Taekyung
    Kim, Daehoon
    Huh, Jaehyuk
    [J]. 30TH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT 2021), 2021, : 116 - 128
  • [10] Row-Wise Product-Based Sparse Matrix Multiplication Hardware Accelerator With Optimal Load Balancing
    Lee, Jong Hun
    Park, Beomjin
    Kong, Joonho
    Munir, Arslan
    [J]. IEEE ACCESS, 2022, 10 : 64547 - 64559