Optimizing deep learning RNN topologies on intel architecture

被引:0
|
作者
Banerjee K. [1 ]
Georganas E. [2 ]
Kalamkar D.D. [1 ]
Ziv B. [3 ]
Segal E. [3 ]
Anderson C. [4 ]
Heinecke A. [2 ]
机构
[1] Intel Corporation, Bangalore
[2] Intel Corporation, Santa Clara
[3] Intel Corporation, Haifa
[4] Intel Corporation, Oregon
关键词
Bandwidth-bound kernel; Compute-bound kernel; Gemm; Intel xeon; Lstm;
D O I
10.14529/jsfi190304
中图分类号
学科分类号
摘要
Recurrent neural network (RNN) models have been found to be well suited for processing temporal data. In this work, we present an optimized implementation of vanilla RNN cell and its two popular variants: LSTM and GRU for Intel Xeon architecture. Typical implementations of these RNN cells employ one or two large matrix multiplication (GEMM) calls and then apply the element-wise operations (sigmoid/tanh) onto the GEMM results. While this approach is easy to implement by exploiting vendor-optimized GEMM library calls, the data reuse relies on how GEMMs are parallelized and is sub-optimal for GEMM sizes stemming from small minibatch. Also, the element-wise operations are exposed as a bandwidth-bound kernel after the GEMM which is typically a compute-bound kernel. To address this discrepancy, we implemented a parallel blocked matrix GEMM in order to (a) achieve load balance, (b) maximize weight matrix reuse, (c) fuse the element-wise operations after partial GEMM blocks are computed and while they are hot in cache. Additionally, we bring the time step loop in our cell to further increase the weight reuse and amortize the overhead to transform the weights into blocked layout. The results show that our implementation is generally faster than Intel MKL-DNN library implementations, e.g. for RNN, forward pass is up to ~3x faster whereas the backward/weight update pass is up to ~5x faster. Furthermore, we investigate high-performance implementations of sigmoid and tanh activation functions that achieve various levels of accuracy. These implementations rely on minimax polynomial approximations, rational polynomials, Taylor expansions and exponential approximation techniques. Our vectorized implementations can be flexibly integrated into deep learning computations with different accuracy requirements without compromising performance; in fact, these are able to outperform vectorized and reduced accuracy vendor-optimized (Intel SVML) libraries by 1.6-2.6 x while speep up over GNU libm is close to two orders of magnitude. All our experiments are conducted on Intel's latest CascadeLake architecture. © The Authors 2019.
引用
收藏
页码:64 / 85
页数:21
相关论文
共 50 条
  • [1] Optimizing for Intel Architecture CPUs
    Owen, JG
    DR DOBBS JOURNAL, 2004, 29 (07): : 8 - 8
  • [2] RNN Architecture Learning with Sparse Regularization
    Dodge, Jesse
    Schwartz, Roy
    Peng, Hao
    Smith, Noah A.
    2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 1179 - 1184
  • [3] Deep Learning Architecture for Detecting SQL Injection Attacks Based on RNN Autoencoder Model
    Alghawazi, Maha
    Alghazzawi, Daniyal
    Alarifi, Suaad
    MATHEMATICS, 2023, 11 (15)
  • [4] Adaptation of RBM Learning for Intel MIC Architecture
    Olas, Tomasz
    Mleczko, Wojciech K.
    Nowicki, Robert K.
    Wyrzykowski, Roman
    Krzyzak, Adam
    ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING, PT I, 2015, 9119 : 90 - 101
  • [5] Optimizing Image Classification: Automated Deep Learning Architecture Crafting with Network and Learning Hyperparameter Tuning
    Ang, Koon Meng
    Lim, Wei Hong
    Tiang, Sew Sun
    Sharma, Abhishek
    Eid, Marwa M.
    Tawfeek, Sayed M.
    Khafaga, Doaa Sami
    Alharbi, Amal H.
    Abdelhamid, Abdelaziz A.
    BIOMIMETICS, 2023, 8 (07)
  • [6] INFERENCE ACCELERATION OF DEEP LEARNING CLASSIFIERS BASED ON RNN
    Keddous, Fekhr Eddine
    Shvai, Nadiya
    Llanza, Arcadi
    Nakib, Amir
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 2450 - 2454
  • [7] A Deep Learning-Based Novel RNN-BiLSTM Architecture for Efficient Threat Detection in IoT Ecosystem
    Chintale, Pradeep
    Naruka, Davinder
    Khanna, Anirudh
    Mandala, Vishwanadham
    Desaboyina, Gopi
    Sure, Tharun Anand Reddy
    ARTIFICIAL INTELLIGENCE AND KNOWLEDGE PROCESSING, AIKP 2024, 2025, 2228 : 198 - 212
  • [8] Optimizing Matrix Multiplication on Intel® Xeon Phi™ x200 Architecture
    Guney, Murat E.
    Goto, Kazushige
    Costa, Timothy B.
    Knepper, Sarah
    Huot, Louise
    Mitrano, Arthur A.
    Story, Shane
    2017 IEEE 24TH SYMPOSIUM ON COMPUTER ARITHMETIC (ARITH), 2017, : 144 - 145
  • [9] Deep Learning for Household Load Forecasting-A Novel Pooling Deep RNN
    Shi, Heng
    Xu, Minghao
    Li, Ran
    IEEE TRANSACTIONS ON SMART GRID, 2018, 9 (05) : 5271 - 5280
  • [10] The Potential of the Intel® Xeon Phi™ for Supervised Deep Learning
    Viebke, Andre
    Pllana, Sabri
    2015 IEEE 17TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS, 2015 IEEE 7TH INTERNATIONAL SYMPOSIUM ON CYBERSPACE SAFETY AND SECURITY, AND 2015 IEEE 12TH INTERNATIONAL CONFERENCE ON EMBEDDED SOFTWARE AND SYSTEMS (ICESS), 2015, : 758 - 765