TCX: A Programmable Tensor Processor

被引:0
|
作者
Liang, Tailin [1 ,2 ]
Wang, Lei [1 ]
Shi, Shaobo [1 ,2 ]
Glossner, John [1 ,3 ]
Zhang, Xiaotong [1 ]
机构
[1] Univ Sci & Technol, Sch Comp Sci & Commun Engn, Beijing 100083, Peoples R China
[2] Hua Xia Gen Processor Technol, Beijing 100080, Peoples R China
[3] Gen Processor Technol, Tarrytown, NY 10591 USA
基金
国家重点研发计划;
关键词
Neural Network Accelerator; Convolutional Neural Network; ASIC Design; EFFICIENT; ACCELERATOR;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Neural network processors and accelerators are domain-specific architectures deployed to solve the high computational requirements of deep learning algorithms. This paper proposes a new instruction set extension for tensor computing, TCX, with RISC-style instructions and variable length tensor extensions. It features a multidimensional register file, dimension registers, and fully generic tensor instructions. It can be seamlessly integrated into existing RISC ISAs and provides software compatibility for scalable hardware implementations. We present an implementation of the TCX tensor computing accelerator using an out-of-order microarchitecture implementation. The tensor accelerator is scalable in computation units from several hundred to tens of thousands. An optimized register renaming mechanism is described which allows for many physical tensor registers without requiring architectural support for large tensor register names. We describe new tensor load and store instructions that reduce bandwidth requirements based on tensor dimensions. Implementations may balance data bandwidth and computation utilization for different types of tensor computations such as element-wise, depth-wise, and matrix-multiplication. We characterize the computation precision of tensor operations to balance area, generality, and accuracy loss for several well-known neural networks. The TCX processor runs at 1 GHz and sustains 8.2 Tera operations per second using a 4096 multiplication-accumulation compute unit with up to 98.83% MAC utilization. It consumes 12.8 square millimeters while dissipating 0.46 Watts per TOP in TSMC 28nm technology.
引用
收藏
页码:1023 / 1028
页数:6
相关论文
共 50 条
  • [21] Programmable fuzzy associative memory processor
    Shao, L
    Liu, LR
    Li, GQ
    OPTICS COMMUNICATIONS, 1996, 129 (1-2) : 89 - 97
  • [22] WARP - A PROGRAMMABLE SYSTOLIC ARRAY PROCESSOR
    KUNG, HT
    MENZILCIOGLU, O
    PROCEEDINGS OF THE SOCIETY OF PHOTO-OPTICAL INSTRUMENTATION ENGINEERS, 1984, 495 : 130 - 136
  • [23] Programmable wavelet packet transform processor
    Wu, XD
    Li, YM
    Chen, HY
    ELECTRONICS LETTERS, 1999, 35 (06) : 449 - 450
  • [24] Programmable interfaces between processor and memory
    Electronic Product Design, 1992, 13 (01):
  • [25] Multiservice, multiprotocol network processor is programmable
    Cravotta, N
    EDN, 1999, 44 (16) : 22 - 22
  • [27] A processor development in programmable logic basis
    Kislyakov, Maxim
    Mosin, Sergey
    2007 PROCEEDINGS OF THE 9TH INTERNATIONAL CONFERENCE ON THE EXPERIENCE OF DESIGNING AND APPLICATION OF CAD SYSTEMS IN MICROELECTRONICS, 2007, : 182 - 185
  • [28] A programmable co-processor for profiling
    Zilles, CB
    Sohi, GS
    HPCA: SEVENTH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTING ARCHITECTURE, PROCEEDINGS, 2001, : 241 - 252
  • [29] PSARP, A PROGRAMMABLE SIGNAL AND RESPONSE PROCESSOR
    DOORNE, HV
    SANDERS, AF
    BEHAVIOR RESEARCH METHODS & INSTRUMENTATION, 1968, 1 (01): : 29 - 32
  • [30] FDP, FAST PROGRAMMABLE SIGNAL PROCESSOR
    GOLD, B
    LEBOW, IL
    MCHUGH, PG
    RADER, CM
    IEEE TRANSACTIONS ON COMPUTERS, 1971, C 20 (01) : 33 - &