Watt: A Write-Optimized RRAM-Based Accelerator for Attention

被引：0

作者：

Zhang, Xuan ^{[1
]}

Song, Zhuoran ^{[1
]}

Li, Xing ^{[1
]}

He, Zhezhi ^{[1
]}

Jing, Naifeng ^{[1
]}

Jiang, Li ^{[1
]}

Liang, Xiaoyao ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, Sch Elect Informat & Elect Engn, Shanghai, Peoples R China

来源：

EURO-PAR 2024: PARALLEL PROCESSING, PART II, EURO-PAR 2024 | 2024年 / 14802卷

基金：

中国国家自然科学基金;

关键词：

Resistive random access memory; accelerator; attention; importance; similarity; workload-aware dynamic scheduler;

D O I：

10.1007/978-3-031-69766-1_8

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Attention-based models, such as Transformer and BERT, have achieved remarkable success across various tasks. However, their deployment is hindered by challenges such as high memory requirements, long inference latency, and significant power consumption. One potential solution for accelerating attention is the utilization of resistive random access memory (RRAM), which exploits process-in-memory (PIM) capability. However, existing RRAM-based accelerators often grapple with costly write operations. Accordingly, we exploit a write-optimized RRAM-based accelerator dubbed Watt for attention-based models, which can further reduce the number of intermediate data written to the RRAM-based crossbars, thereby effectively mitigating workload imbalance among crossbars. Specifically, given the importance and similarity of tokens in the sequences, we design the importance detector and similarity detector to significantly compress the intermediate data K-T and V written to the crossbars. Moreover, due to pruning numerous vectors in K-T and V, the number of vectors written to crossbars varies across different inferences, leading to workload imbalance among crossbars. To tackle this issue, we propose a workload-aware dynamic scheduler comprising a top-k engine and remapping engine. The scheduler first ranks the total write counts of each crossbar and the write counts for each inference using the top-k engine, then assigns inference tasks to the crossbars via the remapping engine. Experimental results show that Watt averagely achieves 6.5x, 4.0x, and 2.1x speedup compared to the state of the art accelerators Sanger, TransPIM, and Retransformer. Meanwhile, it averagely achieves 18.2x, 3.2x, and 2.8x energy savings with respect to the three accelerators.

引用

页码：107 / 120

页数：14

共 50 条

[21] CPR: Crossbar-grain Pruning for an RRAM-based Accelerator with Coordinate-based Weight Mapping
Park, Jihye
Kang, Seokhyeong
2022 IEEE 40TH INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD 2022), 2022, : 336 - 343
[22] SubMac: Exploiting the subword-based computation in RRAM-based CNN accelerator for energy saving and speedup
Chen, Xizi
Jiang, Jingbo
Zhu, Jingyang
Tsui, Chi-Ying
INTEGRATION-THE VLSI JOURNAL, 2019, 69 : 356 - 368
[23] An RRAM-Based Oscillatory Neural Network
Jackson, Thomas C.
Sharma, Abhishek A.
Bain, James A.
Weldon, Jeffrey A.
Pileggi, Lawrence
2015 IEEE 6TH LATIN AMERICAN SYMPOSIUM ON CIRCUITS & SYSTEMS (LASCAS), 2015,
[24] On the Reliability of RRAM-Based Neural Networks
Aziza, Hassen
Zambelli, Cristian
Hamdioui, Said
Diware, Sumit
Bishnoi, Rajendra
Gebregiorgis, Anteneh
2023 IFIP/IEEE 31ST INTERNATIONAL CONFERENCE ON VERY LARGE SCALE INTEGRATION, VLSI-SOC, 2023, : 13 - 20
[25] An RRAM-based MLC design approach
Bagheri-Soulla, A. A.
Ghaznavi-Ghoushchi, M. B.
MICROELECTRONICS JOURNAL, 2017, 64 : 9 - 18
[26] RRAM-Based TCAMs for Pattern Search
Zheng, Le
Shin, Sangho
Lloyd, Scott
Gokhale, Maya
Kim, Kyungmin
Kang, Sung-Mo
2016 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 2016, : 1382 - 1385
[27] RM-NTT: An RRAM-Based Compute-in-Memory Number Theoretic Transform Accelerator
Park, Yongmo
Wang, Ziyu
Yoo, Sangmin
Lu, Wei D. D.
IEEE JOURNAL ON EXPLORATORY SOLID-STATE COMPUTATIONAL DEVICES AND CIRCUITS, 2022, 8 (02): : 93 - 101
[28] Write-Optimized and Consistent Skiplists for Non-Volatile Memory
Xiao, Renzhi
Feng, Dan
Hu, Yuchong
Wang, Fang
Wei, Xueliang
Zou, Xiaomin
Lei, Mengya
IEEE ACCESS, 2021, 9 : 69850 - 69859
[29] FlatLSM: Write-Optimized LSM-Tree for PM-Based KV Stores
He, Kewen
An, Yujie
Luo, Yijing
Liu, Xiaoguang
Wang, Gang
ACM TRANSACTIONS ON STORAGE, 2023, 19 (02)
[30] RRAM-Based Analog Approximate Computing
Li, Boxun
Gu, Peng
Shan, Yi
Wang, Yu
Chen, Yiran
Yang, Huazhong
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2015, 34 (12) : 1905 - 1917

← 1 2 3 4 5 →