Accelerating parallel reduction and scan primitives on ReRAM-based architectures

被引:0
|
作者
Jin Z. [1 ]
Duan Y. [1 ]
Yi E. [1 ]
Ji H. [1 ]
Liu W. [1 ]
机构
[1] College of Information Science and Engineering, China University of Petroleum, Beijing
关键词
parallel computing; processing in memory; reduction; ReRAM; scan;
D O I
10.11887/j.cn.202205009
中图分类号
学科分类号
摘要
Reduction and scan are two critical primitives in parallel computing. Thus, accelerating reduction and scan shows great importance. However, the Von Neumann architecture suffers from performance and energy bottlenecks known as “memory wall” due to the unavoidable data migration. Recently, NVM (nonvolatile memory) such as ReRAM (resistive random access memory), enables in-situ computing without data movement and its crossbar architecture can perform parallel GEMV (matrix-vector multiplication) operation naturally in one step. ReRAM-based architecture has demonstrated great success in many areas, e.g. accelerating machine learning and graph computing applications, etc. Parallel acceleration methods were proposed for reduction and scan primitives on ReRAM-based PIM(processing in memory) architecture, the computing process in terms of GEMV and the mapping method on the ReRAM crossbar were focused, and the co-design of software and hardware was realized to reduce power consumption and improve performance. Compared with GPU, the proposed reduction and scan algorithm achieved substantial speedup by two orders of magnitude, and the average acceleration ratio can also reach two orders of magnitude. The case of segmentation can achieve up to five (four on average) orders of magnitude. Meanwhile, the power consumption decreased by 79%. © 2022 National University of Defense Technology. All rights reserved.
引用
收藏
页码:80 / 91
页数:11
相关论文
共 25 条
  • [1] CUDA toolkit documentation
  • [2] HARRIS M, SENGUPTA S, OWENS J D., Parallel prefix sum (scan) with CUDA, pp. 851-876, (2007)
  • [3] YAN S G, LONG G P, ZHANG Y Q., StreamScan:fast scan algorithms for GPUs without global barrier synchronization[C], Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 229-238, (2013)
  • [4] WANG X Y, YANG J L, ZHAO Y L, Et al., TCIM:triangle counting acceleration with processing-in-MRAM architecture[C], Proceedings of 57th ACM/IEEE Design Automation Conference, pp. 1-6, (2020)
  • [5] DOTSENKO Y, GOVINDARAJU N K, SLOAN P P, Et al., Fast scan algorithms on graphics processors[C], Proceedings of the 22nd Annual International Conference on Supercomputing, pp. 205-213, (2008)
  • [6] SENGUPTA S, HARRIS M, ZHANG Y, Et al., Scan primitives for GPU computing[C], Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware, pp. 97-106, (2007)
  • [7] LONG Y, NA T, MUKHOPADHYAY S., ReRAM-based processing-in-memory architecture for recurrent neural network acceleration[J], IEEE Transactions on Very Large Scale Integration Systems, 26, 12, pp. 2781-2794, (2018)
  • [8] YANG X X, YAN B N, LI H, Et al., ReTransformer:ReRAM-based processing-in-memory architecture for transformer acceleration[C], Proceedings of IEEE/ACM International Conference on Computer Aided Design, pp. 1-9, (2020)
  • [9] CHEN Y R, LI H, CHEN Y Z, Et al., Current status and prospects of neuromorphic computing, AI-View, 5, 2, pp. 46-58, (2018)
  • [10] JI Y, ZHANG Y H, ZHENG W M., Approximate computing method based on memristors[J], Journal of Tsinghua University (Science and Technology), 61, 6, pp. 610-617, (2021)