Optimizing deep learning RNN topologies on intel architecture

被引：0

作者：

Banerjee K. ^{[1
]}

Georganas E. ^{[2
]}

Kalamkar D.D. ^{[1
]}

Ziv B. ^{[3
]}

Segal E. ^{[3
]}

Anderson C. ^{[4
]}

Heinecke A. ^{[2
]}

机构：

[1] Intel Corporation, Bangalore

[2] Intel Corporation, Santa Clara

[3] Intel Corporation, Haifa

[4] Intel Corporation, Oregon

来源：

Supercomputing Frontiers and Innovations | 2019年 / 6卷 / 03期

关键词：

Bandwidth-bound kernel; Compute-bound kernel; Gemm; Intel xeon; Lstm;

D O I：

10.14529/jsfi190304

中图分类号：

学科分类号：

摘要：

Recurrent neural network (RNN) models have been found to be well suited for processing temporal data. In this work, we present an optimized implementation of vanilla RNN cell and its two popular variants: LSTM and GRU for Intel Xeon architecture. Typical implementations of these RNN cells employ one or two large matrix multiplication (GEMM) calls and then apply the element-wise operations (sigmoid/tanh) onto the GEMM results. While this approach is easy to implement by exploiting vendor-optimized GEMM library calls, the data reuse relies on how GEMMs are parallelized and is sub-optimal for GEMM sizes stemming from small minibatch. Also, the element-wise operations are exposed as a bandwidth-bound kernel after the GEMM which is typically a compute-bound kernel. To address this discrepancy, we implemented a parallel blocked matrix GEMM in order to (a) achieve load balance, (b) maximize weight matrix reuse, (c) fuse the element-wise operations after partial GEMM blocks are computed and while they are hot in cache. Additionally, we bring the time step loop in our cell to further increase the weight reuse and amortize the overhead to transform the weights into blocked layout. The results show that our implementation is generally faster than Intel MKL-DNN library implementations, e.g. for RNN, forward pass is up to ~3x faster whereas the backward/weight update pass is up to ~5x faster. Furthermore, we investigate high-performance implementations of sigmoid and tanh activation functions that achieve various levels of accuracy. These implementations rely on minimax polynomial approximations, rational polynomials, Taylor expansions and exponential approximation techniques. Our vectorized implementations can be flexibly integrated into deep learning computations with different accuracy requirements without compromising performance; in fact, these are able to outperform vectorized and reduced accuracy vendor-optimized (Intel SVML) libraries by 1.6-2.6 x while speep up over GNU libm is close to two orders of magnitude. All our experiments are conducted on Intel's latest CascadeLake architecture. © The Authors 2019.

引用

页码：64 / 85

页数：21

共 50 条

[1] Optimizing for Intel Architecture CPUs
Owen, JG
DR DOBBS JOURNAL, 2004, 29 (07): : 8 - 8
[2] RNN Architecture Learning with Sparse Regularization
Dodge, Jesse
Schwartz, Roy
Peng, Hao
Smith, Noah A.
2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 1179 - 1184
[3] Deep Learning Architecture for Detecting SQL Injection Attacks Based on RNN Autoencoder Model
Alghawazi, Maha
Alghazzawi, Daniyal
Alarifi, Suaad
MATHEMATICS, 2023, 11 (15)
[4] Adaptation of RBM Learning for Intel MIC Architecture
Olas, Tomasz
Mleczko, Wojciech K.
Nowicki, Robert K.
Wyrzykowski, Roman
Krzyzak, Adam
ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING, PT I, 2015, 9119 : 90 - 101
[5] Optimizing Image Classification: Automated Deep Learning Architecture Crafting with Network and Learning Hyperparameter Tuning
Ang, Koon Meng
Lim, Wei Hong
Tiang, Sew Sun
Sharma, Abhishek
Eid, Marwa M.
Tawfeek, Sayed M.
Khafaga, Doaa Sami
Alharbi, Amal H.
Abdelhamid, Abdelaziz A.
BIOMIMETICS, 2023, 8 (07)
[6] INFERENCE ACCELERATION OF DEEP LEARNING CLASSIFIERS BASED ON RNN
Keddous, Fekhr Eddine
Shvai, Nadiya
Llanza, Arcadi
Nakib, Amir
2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 2450 - 2454
[7] A Deep Learning-Based Novel RNN-BiLSTM Architecture for Efficient Threat Detection in IoT Ecosystem
Chintale, Pradeep
Naruka, Davinder
Khanna, Anirudh
Mandala, Vishwanadham
Desaboyina, Gopi
Sure, Tharun Anand Reddy
ARTIFICIAL INTELLIGENCE AND KNOWLEDGE PROCESSING, AIKP 2024, 2025, 2228 : 198 - 212
[8] Optimizing Matrix Multiplication on Intel® Xeon Phi™ x200 Architecture
Guney, Murat E.
Goto, Kazushige
Costa, Timothy B.
Knepper, Sarah
Huot, Louise
Mitrano, Arthur A.
Story, Shane
2017 IEEE 24TH SYMPOSIUM ON COMPUTER ARITHMETIC (ARITH), 2017, : 144 - 145
[9] Deep Learning for Household Load Forecasting-A Novel Pooling Deep RNN
Shi, Heng
Xu, Minghao
Li, Ran
IEEE TRANSACTIONS ON SMART GRID, 2018, 9 (05) : 5271 - 5280
[10] The Potential of the Intel® Xeon Phi™ for Supervised Deep Learning
Viebke, Andre
Pllana, Sabri
2015 IEEE 17TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS, 2015 IEEE 7TH INTERNATIONAL SYMPOSIUM ON CYBERSPACE SAFETY AND SECURITY, AND 2015 IEEE 12TH INTERNATIONAL CONFERENCE ON EMBEDDED SOFTWARE AND SYSTEMS (ICESS), 2015, : 758 - 765

← 1 2 3 4 5 →