Optimizing deep learning RNN topologies on intel architecture

被引:0
|
作者
Banerjee K. [1 ]
Georganas E. [2 ]
Kalamkar D.D. [1 ]
Ziv B. [3 ]
Segal E. [3 ]
Anderson C. [4 ]
Heinecke A. [2 ]
机构
[1] Intel Corporation, Bangalore
[2] Intel Corporation, Santa Clara
[3] Intel Corporation, Haifa
[4] Intel Corporation, Oregon
关键词
Bandwidth-bound kernel; Compute-bound kernel; Gemm; Intel xeon; Lstm;
D O I
10.14529/jsfi190304
中图分类号
学科分类号
摘要
Recurrent neural network (RNN) models have been found to be well suited for processing temporal data. In this work, we present an optimized implementation of vanilla RNN cell and its two popular variants: LSTM and GRU for Intel Xeon architecture. Typical implementations of these RNN cells employ one or two large matrix multiplication (GEMM) calls and then apply the element-wise operations (sigmoid/tanh) onto the GEMM results. While this approach is easy to implement by exploiting vendor-optimized GEMM library calls, the data reuse relies on how GEMMs are parallelized and is sub-optimal for GEMM sizes stemming from small minibatch. Also, the element-wise operations are exposed as a bandwidth-bound kernel after the GEMM which is typically a compute-bound kernel. To address this discrepancy, we implemented a parallel blocked matrix GEMM in order to (a) achieve load balance, (b) maximize weight matrix reuse, (c) fuse the element-wise operations after partial GEMM blocks are computed and while they are hot in cache. Additionally, we bring the time step loop in our cell to further increase the weight reuse and amortize the overhead to transform the weights into blocked layout. The results show that our implementation is generally faster than Intel MKL-DNN library implementations, e.g. for RNN, forward pass is up to ~3x faster whereas the backward/weight update pass is up to ~5x faster. Furthermore, we investigate high-performance implementations of sigmoid and tanh activation functions that achieve various levels of accuracy. These implementations rely on minimax polynomial approximations, rational polynomials, Taylor expansions and exponential approximation techniques. Our vectorized implementations can be flexibly integrated into deep learning computations with different accuracy requirements without compromising performance; in fact, these are able to outperform vectorized and reduced accuracy vendor-optimized (Intel SVML) libraries by 1.6-2.6 x while speep up over GNU libm is close to two orders of magnitude. All our experiments are conducted on Intel's latest CascadeLake architecture. © The Authors 2019.
引用
收藏
页码:64 / 85
页数:21
相关论文
共 50 条
  • [41] Optimizing Traffic at Intersections With Deep Reinforcement Learning
    Boyko, Nataliya
    Mokryk, Yaroslav
    JOURNAL OF ENGINEERING, 2024, 2024
  • [42] A survey of techniques for optimizing deep learning on GPUs
    Mittal, Sparsh
    Vaishay, Shraiysh
    JOURNAL OF SYSTEMS ARCHITECTURE, 2019, 99
  • [43] Optimizing Deep Learning Models for Object Detection
    Barburescu, Calin-George
    Iuhasz, Gabriel
    2020 22ND INTERNATIONAL SYMPOSIUM ON SYMBOLIC AND NUMERIC ALGORITHMS FOR SCIENTIFIC COMPUTING (SYNASC 2020), 2020, : 270 - 277
  • [44] Optimizing Kernel Machines Using Deep Learning
    Song, Huan
    Thiagarajan, Jayaraman J.
    Sattigeri, Prasanna
    Spanias, Andreas
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2018, 29 (11) : 5528 - 5540
  • [45] Optimizing Deep Learning for Diabetic Retinopathy Diagnosis
    Sriporn, Krit
    Tsai, Cheng-Fa
    Rong, Li-Jia
    Wang, Paohsi
    Tsai, Tso-Yen
    Chen, Chih-Wen
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2024, 15 (11) : 364 - 373
  • [46] Optimizing Communication in Deep Reinforcement Learning with XingTian
    Pan, Lichen
    Qian, Jun
    Xia, Wei
    Mao, Hangyu
    Yao, Jun
    Li, Pengze
    Xiao, Zhen
    PROCEEDINGS OF THE TWENTY-THIRD ACM/IFIP INTERNATIONAL MIDDLEWARE CONFERENCE, MIDDLEWARE 2022, 2022, : 255 - 268
  • [47] Optimizing Deep Learning Decoders for FPGA Implementation
    Kavvousanos, E.
    Paliouras, V
    2021 31ST INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE LOGIC AND APPLICATIONS (FPL 2021), 2021, : 271 - 272
  • [48] CNN and RNN-based Deep Learning Methods for Digital Signal Demodulation
    Wu, Tian
    PROCEEDINGS OF 2019 INTERNATIONAL CONFERENCE ON IMAGE, VIDEO AND SIGNAL PROCESSING (IVSP 2019), 2019, : 122 - 127
  • [49] RNN-LSTM Based Deep Learning Model for Tor Traffic Classification
    A V.
    Singh H.K.
    M S.
    G J.
    Cyber-Physical Systems, 2023, 9 (01) : 25 - 42
  • [50] A Study on Deep Learning Architecture and Their Applications
    Ghimire, Samip
    Ghimire, Sarala
    Subedi, Santosh
    2019 INTERNATIONAL CONFERENCE ON POWER ELECTRONICS, CONTROL AND AUTOMATION (ICPECA-2019), 2019, : 430 - 435