Optimizing deep learning RNN topologies on intel architecture

被引:0
|
作者
Banerjee K. [1 ]
Georganas E. [2 ]
Kalamkar D.D. [1 ]
Ziv B. [3 ]
Segal E. [3 ]
Anderson C. [4 ]
Heinecke A. [2 ]
机构
[1] Intel Corporation, Bangalore
[2] Intel Corporation, Santa Clara
[3] Intel Corporation, Haifa
[4] Intel Corporation, Oregon
关键词
Bandwidth-bound kernel; Compute-bound kernel; Gemm; Intel xeon; Lstm;
D O I
10.14529/jsfi190304
中图分类号
学科分类号
摘要
Recurrent neural network (RNN) models have been found to be well suited for processing temporal data. In this work, we present an optimized implementation of vanilla RNN cell and its two popular variants: LSTM and GRU for Intel Xeon architecture. Typical implementations of these RNN cells employ one or two large matrix multiplication (GEMM) calls and then apply the element-wise operations (sigmoid/tanh) onto the GEMM results. While this approach is easy to implement by exploiting vendor-optimized GEMM library calls, the data reuse relies on how GEMMs are parallelized and is sub-optimal for GEMM sizes stemming from small minibatch. Also, the element-wise operations are exposed as a bandwidth-bound kernel after the GEMM which is typically a compute-bound kernel. To address this discrepancy, we implemented a parallel blocked matrix GEMM in order to (a) achieve load balance, (b) maximize weight matrix reuse, (c) fuse the element-wise operations after partial GEMM blocks are computed and while they are hot in cache. Additionally, we bring the time step loop in our cell to further increase the weight reuse and amortize the overhead to transform the weights into blocked layout. The results show that our implementation is generally faster than Intel MKL-DNN library implementations, e.g. for RNN, forward pass is up to ~3x faster whereas the backward/weight update pass is up to ~5x faster. Furthermore, we investigate high-performance implementations of sigmoid and tanh activation functions that achieve various levels of accuracy. These implementations rely on minimax polynomial approximations, rational polynomials, Taylor expansions and exponential approximation techniques. Our vectorized implementations can be flexibly integrated into deep learning computations with different accuracy requirements without compromising performance; in fact, these are able to outperform vectorized and reduced accuracy vendor-optimized (Intel SVML) libraries by 1.6-2.6 x while speep up over GNU libm is close to two orders of magnitude. All our experiments are conducted on Intel's latest CascadeLake architecture. © The Authors 2019.
引用
收藏
页码:64 / 85
页数:21
相关论文
共 50 条
  • [21] Characterizing and optimizing Java-based HPC applications on Intel many-core architecture
    Yang YU
    Tianyang LEI
    Haibo CHEN
    Binyu ZANG
    Science China(Information Sciences), 2017, 60 (12) : 207 - 223
  • [22] Structural-RNN: Deep Learning on Spatio-Temporal Graphs
    Jain, Ashesh
    Zamir, Amir R.
    Savarese, Silvio
    Saxena, Ashutosh
    2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 5308 - 5317
  • [23] Deep RNN Learning for EEG based Functional Brain State Inference
    Patnaik, Suprava
    Moharkar, Lalita
    Chaudhari, Amogh
    2017 IEEE INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATION AND CONTROL (ICAC3), 2017,
  • [24] LC-RNN: A Deep Learning Model for Traffic Speed Prediction
    Lv, Zhongjian
    Xu, Jiajie
    Zheng, Kai
    Yin, Hongzhi
    Zhao, Pengpeng
    Zhou, Xiaofang
    PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 3470 - 3476
  • [25] Deep learning PM2.5 concentrations with bidirectional LSTM RNN
    Weitian Tong
    Lixin Li
    Xiaolu Zhou
    Andrew Hamilton
    Kai Zhang
    Air Quality, Atmosphere & Health, 2019, 12 : 411 - 423
  • [26] An intelligent Chatbot using deep learning with Bidirectional RNN and attention model
    Dhyani, Manyu
    Kumar, Rajiv
    MATERIALS TODAY-PROCEEDINGS, 2021, 34 : 817 - 824
  • [27] Deep Learning (CNN, RNN) Applications for Smart Homes: A Systematic Review
    Yu, Jiyeon
    de Antonio, Angelica
    Villalba-Mora, Elena
    COMPUTERS, 2022, 11 (02)
  • [28] Deep learning PM2.5 concentrations with bidirectional LSTM RNN
    Tong, Weitian
    Li, Lixin
    Zhou, Xiaolu
    Hamilton, Andrew
    Zhang, Kai
    AIR QUALITY ATMOSPHERE AND HEALTH, 2019, 12 (04): : 411 - 423
  • [29] ET-RNN: Applying Deep Learning to Credit Loan Applications
    Babaev, Dmitrii
    Savchenko, Maxim
    Tuzhilin, Alexander
    Umerenkov, Dmitrii
    KDD'19: PROCEEDINGS OF THE 25TH ACM SIGKDD INTERNATIONAL CONFERENCCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2019, : 2183 - 2190
  • [30] LSTM based Deep RNN Architecture for Election Sentiment Analysis from Bengali Newspaper
    Saha, Baidya Nath
    Senapati, Apurbalal
    Mahajan, Anmol
    2020 INTERNATIONAL CONFERENCE ON COMPUTATIONAL PERFORMANCE EVALUATION (COMPE-2020), 2020, : 564 - 569