An FPGA-Based Transformer Accelerator With Parallel Unstructured Sparsity Handling for Question-Answering Applications
被引:0
|
作者:
Cao, Rujian
论文数: 0引用数: 0
h-index: 0
机构:
Univ Macau, Inst Microelect, State Key Lab Analog & Mixed Signal VLSI, Macau, Peoples R China
Univ Macau, Fac Sci & Technol, ECE, Macau, Peoples R ChinaUniv Macau, Inst Microelect, State Key Lab Analog & Mixed Signal VLSI, Macau, Peoples R China
Cao, Rujian
[1
,2
]
Zhao, Zhongyu
论文数: 0引用数: 0
h-index: 0
机构:
Univ Macau, Inst Microelect, State Key Lab Analog & Mixed Signal VLSI, Macau, Peoples R China
Univ Macau, Fac Sci & Technol, ECE, Macau, Peoples R ChinaUniv Macau, Inst Microelect, State Key Lab Analog & Mixed Signal VLSI, Macau, Peoples R China
Zhao, Zhongyu
[1
,2
]
论文数: 引用数:
h-index:
机构:
Un, Ka-Fai
[1
,2
]
Yu, Wei-Han
论文数: 0引用数: 0
h-index: 0
机构:
Univ Macau, Inst Microelect, State Key Lab Analog & Mixed Signal VLSI, Macau, Peoples R China
Univ Macau, Fac Sci & Technol, ECE, Macau, Peoples R ChinaUniv Macau, Inst Microelect, State Key Lab Analog & Mixed Signal VLSI, Macau, Peoples R China
Yu, Wei-Han
[1
,2
]
Martins, Rui P.
论文数: 0引用数: 0
h-index: 0
机构:
Univ Macau, Inst Microelect, State Key Lab Analog & Mixed Signal VLSI, Macau, Peoples R China
Univ Macau, Fac Sci & Technol, ECE, Macau, Peoples R China
Univ Lisbon, Inst Super Tecn, P-1049001 Lisbon, PortugalUniv Macau, Inst Microelect, State Key Lab Analog & Mixed Signal VLSI, Macau, Peoples R China
Martins, Rui P.
[1
,2
,3
]
论文数: 引用数:
h-index:
机构:
Mak, Pui-In
[1
,2
]
机构:
[1] Univ Macau, Inst Microelect, State Key Lab Analog & Mixed Signal VLSI, Macau, Peoples R China
[2] Univ Macau, Fac Sci & Technol, ECE, Macau, Peoples R China
[3] Univ Lisbon, Inst Super Tecn, P-1049001 Lisbon, Portugal
Dataflow management provides limited performance improvement to the transformer model due to its lesser weight reuse than the convolution neural network. The cosFormer reduced computational complexity while achieving comparable performance to the vanilla transformer for natural language processing tasks. However, the unstructured sparsity in the cosFormer makes it a challenge to be implemented efficiently. This brief proposes a parallel unstructured sparsity handling (PUSH) scheme to compute sparse-dense matrix multiplication (SDMM) efficiently. It transforms unstructured sparsity into structured sparsity and reduces the total memory access by balancing the memory accesses of the sparse and dense matrices in the SDMM. We also employ unstructured weight pruning cooperating with PUSH to further increase the structured sparsity of the model. Through verification on an FPGA platform, the proposed accelerator achieves a throughput of 2.82 TOPS and an energy efficiency of 144.8 GOPs/W for HotpotQA dataset with long sequences.