CoExE: An Efficient Co-execution Architecture for Real-Time Neural Network Services

被引：2

作者：

Liu, Chubo ^{[1
]}

Li, Kenli ^{[1
]}

Song, Mingcong ^{[2
]}

Zhao, Jiechen ^{[2
]}

Li, Keqin ^{[3
]}

Li, Tao ^{[2
]}

Zeng, Zihao ^{[1
]}

机构：

[1] Hunan Univ, Coll Informat Sci & Engn, Changsha, Peoples R China

[2] Univ Florida, Dept Elect & Comp Engn, Gainesville, FL 32611 USA

[3] SUNY Coll New Paltz, Dept Comp Sci, New Paltz, NY USA

来源：

PROCEEDINGS OF THE 2020 57TH ACM/EDAC/IEEE DESIGN AUTOMATION CONFERENCE (DAC) | 2020年

关键词：

Al-cloud acceleration; Co-execution architecture; Real-time NN service; Sparsity-driven multi-context;

D O I：

10.1109/dac18072.2020.9218740

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

End-to-end latency is sensitive for user-interactive neural network (NN) services on clouds. For periods of high request load, co-locating multiple NN requests has the potential to reduce end-to-end latency. However, current batch-based accelerators lack request-level parallelism support, leaving the queuing time non-optimized. Meanwhile, naively partitioning resources for simultaneous requests suffers from longer execution time as well as lower resource efficiency because different applications utilize separate resources without sharing. To effectively reduce the end-to-end latency for real-time NN requests, we propose CoExE architecture, equipped with a pipeline implementation of a sparsity-driven real-time co-execution model. By leveraging the non-trivial amount of sparse operations during concurrent NNs execution, the end-to-end latency is decreased by up to 12.3x and 2.4x over Eyeriss-like and SCNN at peak workload mode. Besides, we propose row cross (RC) dataflow to reduce data movement cost, and avoid memory duplication.

引用

页数：6

共 50 条

[1] Real-time architecture for neural network applications
Crespo, A
Hassan, H
Andreu, G
Simo, J
REAL TIME PROGRAMMING 1997: (WRTP 97), 1998, : 23 - 28
[2] Efficient Nonlinear Autoregressive Neural Network Architecture for Real-Time Biomedical Applications
Olney, Brooks
Mahmud, Shakil
Karam, Robert
2022 IEEE INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE CIRCUITS AND SYSTEMS (AICAS 2022): INTELLIGENT TECHNOLOGY IN THE POST-PANDEMIC ERA, 2022, : 411 - 414
[3] Designing Real-Time Neural Networks by Efficient Neural Architecture Search
Bo, Zitong
Li, Yilin
Qiao, Ying
Leng, Chang
Wang, Hongan
ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT IV, ICIC 2024, 2024, 14865 : 62 - 73
[4] BASIC TASK SERVICES FOR REAL-TIME EXECUTION
RIPPS, DL
EDN, 1990, 35 (23) : 249 - 258
[5] RealNet:: a neural network architecture for real-time systems scheduling
Domínguez, E
Jerez, J
Llopis, L
Morante, A
NEURAL COMPUTING & APPLICATIONS, 2004, 13 (04): : 281 - 287
[6] RealNet: a neural network architecture for real-time systems scheduling
E. Domínguez
J. Jerez
L. Llopis
A. Morante
Neural Computing & Applications, 2004, 13 : 281 - 287
[7] An Architecture for the Simultaneous Execution of Hard Real-Time Threads
Barre, Jonathan
Rochange, Christine
Sainrat, Pascal
2008 INTERNATIONAL CONFERENCE ON EMBEDDED COMPUTER SYSTEMS: ARCHITECTURES, MODELING AND SIMULATION, PROCEEDINGS, 2008, : 18 - 24
[8] THE EXECUTION MODEL AND THE ARCHITECTURE FOR REAL-TIME PARALLEL SYSTEMS
YAMAGUCHI, Y
TODA, K
NISHIDA, K
TAKAHASHI, E
INFORMATION PROCESSING '94, VOL I: TECHNOLOGY AND FOUNDATIONS, 1994, 51 : 177 - 182
[9] Real-time Sign Language Recognition based on Neural Network Architecture
Mekala, Priyanka
Gao, Ying
Fan, Jeffrey
Davari, Asad
PROCEEDINGS SSST 2011: 43RD IEEE SOUTHEASTERN SYMPOSIUM ON SYSTEM THEORY, 2011, : 195 - 199
[10] Design of a pipelined hardware architecture for real-time neural network computations
Ayala, JL
Lomeña, AG
López-Vallejo, M
Fernández, A
2002 45TH MIDWEST SYMPOSIUM ON CIRCUITS AND SYSTEMS, VOL I, CONFERENCE PROCEEDINGS, 2002, : 419 - 422

← 1 2 3 4 5 →