Watt: A Write-Optimized RRAM-Based Accelerator for Attention

被引:0
|
作者
Zhang, Xuan [1 ]
Song, Zhuoran [1 ]
Li, Xing [1 ]
He, Zhezhi [1 ]
Jing, Naifeng [1 ]
Jiang, Li [1 ]
Liang, Xiaoyao [1 ]
机构
[1] Shanghai Jiao Tong Univ, Sch Elect Informat & Elect Engn, Shanghai, Peoples R China
基金
中国国家自然科学基金;
关键词
Resistive random access memory; accelerator; attention; importance; similarity; workload-aware dynamic scheduler;
D O I
10.1007/978-3-031-69766-1_8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Attention-based models, such as Transformer and BERT, have achieved remarkable success across various tasks. However, their deployment is hindered by challenges such as high memory requirements, long inference latency, and significant power consumption. One potential solution for accelerating attention is the utilization of resistive random access memory (RRAM), which exploits process-in-memory (PIM) capability. However, existing RRAM-based accelerators often grapple with costly write operations. Accordingly, we exploit a write-optimized RRAM-based accelerator dubbed Watt for attention-based models, which can further reduce the number of intermediate data written to the RRAM-based crossbars, thereby effectively mitigating workload imbalance among crossbars. Specifically, given the importance and similarity of tokens in the sequences, we design the importance detector and similarity detector to significantly compress the intermediate data K-T and V written to the crossbars. Moreover, due to pruning numerous vectors in K-T and V, the number of vectors written to crossbars varies across different inferences, leading to workload imbalance among crossbars. To tackle this issue, we propose a workload-aware dynamic scheduler comprising a top-k engine and remapping engine. The scheduler first ranks the total write counts of each crossbar and the write counts for each inference using the top-k engine, then assigns inference tasks to the crossbars via the remapping engine. Experimental results show that Watt averagely achieves 6.5x, 4.0x, and 2.1x speedup compared to the state of the art accelerators Sanger, TransPIM, and Retransformer. Meanwhile, it averagely achieves 18.2x, 3.2x, and 2.8x energy savings with respect to the three accelerators.
引用
收藏
页码:107 / 120
页数:14
相关论文
共 50 条
  • [41] Write-Optimized and Consistent RDMA-based Non-Volatile Main Memory Systems
    Liu, Xinxin
    Hua, Yu
    Li, Xuan
    Liu, Qifan
    2021 IEEE 39TH INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD 2021), 2021, : 248 - 251
  • [42] LazyStore: Write-optimized Key-value Storage System Based on Hybrid Storage Architecture
    Du, Yun-Xiao
    Chen, Ke
    Shou, Li-Dan
    Jiang, Da-Wei
    Luo, Xin-Yuan
    Chen, Gang
    Ruan Jian Xue Bao/Journal of Software, 2025, 36 (02): : 805 - 829
  • [43] Fault Tolerance for RRAM-Based Matrix Operations
    Liu, Mengyun
    Xia, Lixue
    Wang, Yu
    Chakrabarty, Krishnendu
    2018 IEEE INTERNATIONAL TEST CONFERENCE (ITC), 2018,
  • [44] Read/write-optimized tree indexing for solid-state drives
    Jin, Peiquan
    Yang, Chengcheng
    Jensen, Christian S.
    Yang, Puyuan
    Yue, Lihua
    VLDB JOURNAL, 2016, 25 (05): : 695 - 717
  • [45] Brief Announcement: Root-to-Leaf Scheduling in Write-Optimized Trees
    Chung, Christopher
    Jannen, William
    McCauley, Samuel
    Simon, Bertrand
    PROCEEDINGS OF THE 36TH ACM SYMPOSIUM ON PARALLELISM IN ALGORITHMS AND ARCHITECTURES, SPAA 2024, 2024, : 475 - 477
  • [46] RRAM-based synapse devices for neuromorphic systems
    Moon, K.
    Lim, S.
    Park, J.
    Sung, C.
    Oh, S.
    Woo, J.
    Lee, J.
    Hwang, H.
    FARADAY DISCUSSIONS, 2019, 213 : 421 - 451
  • [47] RRAM-based Analog In-Memory Computing
    Chen, Xiaoming
    Song, Tao
    Han, Yinhe
    2021 IEEE/ACM INTERNATIONAL SYMPOSIUM ON NANOSCALE ARCHITECTURES (NANOARCH), 2021,
  • [48] Read/write-optimized tree indexing for solid-state drives
    Peiquan Jin
    Chengcheng Yang
    Christian S. Jensen
    Puyuan Yang
    Lihua Yue
    The VLDB Journal, 2016, 25 : 695 - 717
  • [49] A Write-Optimized B-Tree Layer for NAND Flash Memory
    Gong, Xiaona
    Chen, Shuyu
    Lin, Mingwei
    Liu, Haozhang
    2011 7TH INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, NETWORKING AND MOBILE COMPUTING (WICOM), 2011,
  • [50] An RRAM-based Neural Radiance Field Processor
    Zheng, Yueyang
    Rao, Chaolin
    Wan, Haochuan
    Zhou, Yuliang
    Zhou, Pingqiang
    Yu, Jingyi
    Lou, Xin
    2022 IEEE 35TH INTERNATIONAL SYSTEM-ON-CHIP CONFERENCE (IEEE SOCC 2022), 2022, : 305 - 309