CoExE: An Efficient Co-execution Architecture for Real-Time Neural Network Services

被引:2
|
作者
Liu, Chubo [1 ]
Li, Kenli [1 ]
Song, Mingcong [2 ]
Zhao, Jiechen [2 ]
Li, Keqin [3 ]
Li, Tao [2 ]
Zeng, Zihao [1 ]
机构
[1] Hunan Univ, Coll Informat Sci & Engn, Changsha, Peoples R China
[2] Univ Florida, Dept Elect & Comp Engn, Gainesville, FL 32611 USA
[3] SUNY Coll New Paltz, Dept Comp Sci, New Paltz, NY USA
关键词
Al-cloud acceleration; Co-execution architecture; Real-time NN service; Sparsity-driven multi-context;
D O I
10.1109/dac18072.2020.9218740
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
End-to-end latency is sensitive for user-interactive neural network (NN) services on clouds. For periods of high request load, co-locating multiple NN requests has the potential to reduce end-to-end latency. However, current batch-based accelerators lack request-level parallelism support, leaving the queuing time non-optimized. Meanwhile, naively partitioning resources for simultaneous requests suffers from longer execution time as well as lower resource efficiency because different applications utilize separate resources without sharing. To effectively reduce the end-to-end latency for real-time NN requests, we propose CoExE architecture, equipped with a pipeline implementation of a sparsity-driven real-time co-execution model. By leveraging the non-trivial amount of sparse operations during concurrent NNs execution, the end-to-end latency is decreased by up to 12.3x and 2.4x over Eyeriss-like and SCNN at peak workload mode. Besides, we propose row cross (RC) dataflow to reduce data movement cost, and avoid memory duplication.
引用
收藏
页数:6
相关论文
共 50 条
  • [1] Real-time architecture for neural network applications
    Crespo, A
    Hassan, H
    Andreu, G
    Simo, J
    REAL TIME PROGRAMMING 1997: (WRTP 97), 1998, : 23 - 28
  • [2] Efficient Nonlinear Autoregressive Neural Network Architecture for Real-Time Biomedical Applications
    Olney, Brooks
    Mahmud, Shakil
    Karam, Robert
    2022 IEEE INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE CIRCUITS AND SYSTEMS (AICAS 2022): INTELLIGENT TECHNOLOGY IN THE POST-PANDEMIC ERA, 2022, : 411 - 414
  • [3] Designing Real-Time Neural Networks by Efficient Neural Architecture Search
    Bo, Zitong
    Li, Yilin
    Qiao, Ying
    Leng, Chang
    Wang, Hongan
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT IV, ICIC 2024, 2024, 14865 : 62 - 73
  • [4] BASIC TASK SERVICES FOR REAL-TIME EXECUTION
    RIPPS, DL
    EDN, 1990, 35 (23) : 249 - 258
  • [5] RealNet:: a neural network architecture for real-time systems scheduling
    Domínguez, E
    Jerez, J
    Llopis, L
    Morante, A
    NEURAL COMPUTING & APPLICATIONS, 2004, 13 (04): : 281 - 287
  • [6] RealNet: a neural network architecture for real-time systems scheduling
    E. Domínguez
    J. Jerez
    L. Llopis
    A. Morante
    Neural Computing & Applications, 2004, 13 : 281 - 287
  • [7] An Architecture for the Simultaneous Execution of Hard Real-Time Threads
    Barre, Jonathan
    Rochange, Christine
    Sainrat, Pascal
    2008 INTERNATIONAL CONFERENCE ON EMBEDDED COMPUTER SYSTEMS: ARCHITECTURES, MODELING AND SIMULATION, PROCEEDINGS, 2008, : 18 - 24
  • [8] THE EXECUTION MODEL AND THE ARCHITECTURE FOR REAL-TIME PARALLEL SYSTEMS
    YAMAGUCHI, Y
    TODA, K
    NISHIDA, K
    TAKAHASHI, E
    INFORMATION PROCESSING '94, VOL I: TECHNOLOGY AND FOUNDATIONS, 1994, 51 : 177 - 182
  • [9] Real-time Sign Language Recognition based on Neural Network Architecture
    Mekala, Priyanka
    Gao, Ying
    Fan, Jeffrey
    Davari, Asad
    PROCEEDINGS SSST 2011: 43RD IEEE SOUTHEASTERN SYMPOSIUM ON SYSTEM THEORY, 2011, : 195 - 199
  • [10] Design of a pipelined hardware architecture for real-time neural network computations
    Ayala, JL
    Lomeña, AG
    López-Vallejo, M
    Fernández, A
    2002 45TH MIDWEST SYMPOSIUM ON CIRCUITS AND SYSTEMS, VOL I, CONFERENCE PROCEEDINGS, 2002, : 419 - 422