TrIMS: Transparent and Isolated Model Sharing for Low Latency Deep Learning Inference in Function-as-a-Service

被引:14
|
作者
Dakkak, Abdul [1 ]
Li, Cheng [1 ]
de Gonzalo, Simon Garcia [1 ]
Xiong, Jinjun [2 ]
Hwu, Wen-mei [3 ]
机构
[1] Univ Illinois, Dept Comp Sci, Urbana, IL 61801 USA
[2] IBM Thomas J Watson Res Ctr, Yorktown Hts, NY USA
[3] Univ Illinois, Dept Elect & Comp Engn, Champaign, IL USA
关键词
D O I
10.1109/CLOUD.2019.00067
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Deep neural networks (DNNs) have become core computation components within low latency Function as a Service (FaaS) prediction pipelines. Cloud computing, as the de-facto backbone of modern computing infrastructure, has to be able to handle user-defined FaaS pipelines containing diverse DNN inference workloads while maintaining isolation and latency guarantees with minimal resource waste. The current solution for guaranteeing isolation and latency within FaaS is inefficient. A major cause of the inefficiency is the need to move large amount of data within and across servers. We propose TrIMS as a novel solution to address this issue. TrIMS is a generic memory sharing technique that enables constant data to be shared across processes or containers while still maintaining isolation between users. TrIMS consists of a persistent model store across the GPU, CPU, local storage, and cloud storage hierarchy, an efficient resource management layer that provides isolation, and a succinct set of abstracts, application APIs, and container technologies for easy and transparent integration with FaaS, Deep Learning (DL) frameworks, and user code. We demonstrate our solution by interfacing TrIMS with the Apache MXNet framework and demonstrate up to 24x speedup in latency for image classification models, up to 210x speedup for large models, and up to 8x system throughput improvement.
引用
收藏
页码:372 / 382
页数:11
相关论文
共 16 条
  • [1] GPU-enabled Function-as-a-Service for Machine Learning Inference
    Zhao, Ming
    Jha, Kritshekhar
    Hong, Sungho
    2023 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM, IPDPS, 2023, : 918 - 928
  • [2] Network Measurements with Function-as-a-Service for Distributed Low-latency Edge Applications
    Carlini, Emanuele
    Kavalionak, Hanna
    Dazzi, Patrizio
    Ferrucci, Luca
    Coppola, Massimo
    Mordacchini, Matteo
    2ND WORKSHOP ON FLEXIBLE RESOURCE AND APPLICATION MANAGEMENT ON THE EDGE, FRAME 2022, 2022, : 25 - 28
  • [3] Low Latency Deep Learning Inference Model for Distributed Intelligent IoT Edge Clusters
    Naveen, Soumyalatha
    Kounte, Manjunath R.
    Ahmed, Mohammed Riyaz
    IEEE ACCESS, 2021, 9 : 160607 - 160621
  • [4] A Deep-Learning Model for Service QoS Prediction Based on Feature Mapping and Inference
    Zhang, Peiyun
    Ren, Jigang
    Huang, Wenjun
    Chen, Yutong
    Zhao, Qinglin
    Zhu, Haibin
    IEEE TRANSACTIONS ON SERVICES COMPUTING, 2024, 17 (04) : 1311 - 1325
  • [5] Deep-Cross-Attention Recommendation Model for Knowledge Sharing Micro Learning Service
    Lin, Jiayin
    Sun, Geng
    Shen, Jun
    Pritchard, David
    Cui, Tingru
    Xu, Dongming
    Li, Li
    Beydoun, Ghassan
    Chen, Shiping
    ARTIFICIAL INTELLIGENCE IN EDUCATION (AIED 2020), PT II, 2020, 12164 : 168 - 173
  • [6] Low-latency Virtual Network function Scheduling Algorithm Based on Deep Reinforcement Learning
    Liu, Zhiwei
    Shu, Zhaogang
    Chen, Shuwu
    Zhong, Yiwen
    Lin, Jiaxiang
    COMPUTER NETWORKS, 2024, 246
  • [7] CoFB: latency-constrained co-scheduling of flows and batches for deep learning inference service on the CPU–GPU system
    Qi Zhang
    Yi Liu
    Tao Liu
    Depei Qian
    The Journal of Supercomputing, 2023, 79 : 14172 - 14199
  • [8] Latency-Sensitive Service Function Chains Intelligent Migration in Satellite Communication Driven by Deep Reinforcement Learning
    Zhang, Peiying
    Li, Yilin
    Tan, Lizhuang
    Liu, Kai
    Wen, Miao
    Hao, Hao
    TRANSACTIONS ON EMERGING TELECOMMUNICATIONS TECHNOLOGIES, 2024, 35 (11):
  • [9] CoFB: latency-constrained co-scheduling of flows and batches for deep learning inference service on the CPU-GPU system
    Zhang, Qi
    Liu, Yi
    Liu, Tao
    Qian, Depei
    JOURNAL OF SUPERCOMPUTING, 2023, 79 (13): : 14172 - 14199
  • [10] Ekko: A Large-Scale Deep Learning Recommender System with Low-Latency Model Update
    Sima, Chijun
    Fu, Yao
    Sit, Man-Kit
    Guo, Liyi
    Gong, Xuri
    Lin, Feng
    Wu, Junyu
    Li, Yongsheng
    Rong, Haidong
    Aublin, Pierre-Louis
    Mai, Luo
    PROCEEDINGS OF THE 16TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, OSDI 2022, 2022, : 821 - 839