Mentor: A Memory-Efficient Sparse-dense Matrix Multiplication Accelerator Based on Column-Wise Product

被引：0

作者：

Lu, Xiaobo ^{[1
]}

Fang, Jianbin ^{[1
]}

Peng, Lin ^{[1
]}

Huang, Chun ^{[1
]}

Du, Zidong ^{[2
]}

Zhao, Yongwei ^{[2
]}

Wang, Zheng ^{[3
]}

机构：

[1] School of Computer Science and Technology, National University of Defense Technology, Changsha,60024350, China

[2] Institute of Computing Technology, Chinese Academy of Sciences, Beijing,60030904, China

[3] Northwest University, Xi'an,60001298, China

来源：

ACM Transactions on Architecture and Code Optimization | 2024年 / 21卷 / 04期

关键词：

Analog storage - Computer software reusability - Digital storage - Hadrons - Hardware-software codesign - Input output programs - Linear accelerators - Matrix algebra - Photons - Printed circuit design - Program debugging - Semiconductor storage - Structural dynamics;

D O I：

10.1145/3688612

中图分类号：

学科分类号：

摘要：

Sparse-dense matrix multiplication (SpMM) is the performance bottleneck of many high-performance and deep-learning applications, making it attractive to design specialized SpMM hardware accelerators. Unfortunately, existing hardware solutions do not take full advantage of data reuse opportunities of the input and output matrices or suffer from irregular memory access patterns. Their strategies increase the off-chip memory traffic and bandwidth pressure, leaving much room for improvement. We present Mentor, a new approach to designing SpMM accelerators. Our key insight is that column-wise dataflow, while rarely exploited in prior works, can address these issues in SpMM computations. Mentor is a software-hardware co-design approach for leveraging column-wise dataflow to improve data reuse and regular memory accesses of SpMM. On the software level, Mentor incorporates a novel streaming construction scheme to preprocess the input matrix for enabling a streaming access pattern. On the hardware level, it employs a fully pipelined design to unlock the potential of column-wise dataflow further. The design of Mentor is underpinned by a carefully designed analytical model to find the tradeoff between performance and hardware resources. We have implemented an FPGA prototype of Mentor. Experimental results show that Mentor achieves speedup by geomean 2.05× (up to 3.98×), reduces the memory traffic by geomean 2.92× (up to 4.93×), and improves bandwidth utilization by geomean 1.38× (up to 2.89×), compared with the state-of-the-art hardware solutions. © 2024 Copyright held by the owner/author(s).

引用

共 17 条

[1] REVISITING COLUMN-WISE VECTOR QUANTIZATION FOR MEMORY-EFFICIENT MATRIX MULTIPLICATION
Matsui, Yusuke
Satoh, Shin'ichi
[J]. 2018 25TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2018, : 1158 - 1162
[2] An Efficient Sparse-Dense Matrix Multiplication on a Multicore System
Yan, Di
Wu, Tao
Liu, Ying
Gao, Yang
[J]. 2017 17TH IEEE INTERNATIONAL CONFERENCE ON COMMUNICATION TECHNOLOGY (ICCT 2017), 2017, : 1880 - 1883
[3] MAPPARAT: A Resource Constrained FPGA-Based Accelerator for Sparse-Dense Matrix Multiplication
Ashuthosh, M. R.
Krishna, Santosh
Sudarshan, Vishvas
Subramaniyan, Srinivasan
Purnaprajna, Madhura
[J]. 2022 35TH INTERNATIONAL CONFERENCE ON VLSI DESIGN (VLSID 2022) HELD CONCURRENTLY WITH 2022 21ST INTERNATIONAL CONFERENCE ON EMBEDDED SYSTEMS (ES 2022), 2022, : 102 - 107
[4] GROW: A Row-Stationary Sparse-Dense GEMM Accelerator for Memory-Efficient Graph Convolutional Neural Networks
Hwang, Ranggi
Kang, Minhoo
Lee, Jiwon
Kam, Dongyun
Lee, Youngjoo
Rhu, Minsoo
[J]. 2023 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE, HPCA, 2023, : 42 - 55
[5] MatRaptor: A Sparse-Sparse Matrix Multiplication Accelerator Based on Row-Wise Product
Srivastava, Nitish
Jin, Hanchen
Liu, Jie
Albonesi, David
Zhang, Zhiru
[J]. 2020 53RD ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO 2020), 2020, : 766 - 780
[6] SDMA: An Efficient and Flexible Sparse-Dense Matrix-Multiplication Architecture for GNNs
Gao, Yingxue
Gong, Lei
Wang, Chao
Wang, Teng
Zhou, Xuehai
[J]. 2022 32ND INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE LOGIC AND APPLICATIONS, FPL, 2022, : 307 - 312
[7] Efficient Sparse-Dense Matrix-Matrix Multiplication on GPUs Using the Customized Sparse Storage Format
Shi, Shaohuai
Wang, Qiang
Chu, Xiaowen
[J]. 2020 IEEE 26TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2020, : 19 - 26
[8] OuterSPACE: An Outer Product based Sparse Matrix Multiplication Accelerator
Pal, Subhankar
Beaumont, Jonathan
Park, Dong-Hyeon
Amarnath, Aporva
Feng, Siying
Chakrabarti, Chaitali
Kim, Hun-Seok
Blaauw, David
Mudge, Trevor
Dreslinski, Ronald
[J]. 2018 24TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA), 2018, : 724 - 736
[9] InnerSP: A Memory Efficient Sparse Matrix Multiplication Accelerator with Locality-aware Inner Product Processing
Baek, Daehyeon
Hwang, Soojin
Heo, Taekyung
Kim, Daehoon
Huh, Jaehyuk
[J]. 30TH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT 2021), 2021, : 116 - 128
[10] Row-Wise Product-Based Sparse Matrix Multiplication Hardware Accelerator With Optimal Load Balancing
Lee, Jong Hun
Park, Beomjin
Kong, Joonho
Munir, Arslan
[J]. IEEE ACCESS, 2022, 10 : 64547 - 64559

← 1 2 →