Ekko: A Large-Scale Deep Learning Recommender System with Low-Latency Model Update

被引:0
|
作者
Sima, Chijun [1 ]
Fu, Yao [2 ]
Sit, Man-Kit [2 ]
Guo, Liyi [1 ]
Gong, Xuri [1 ]
Lin, Feng [1 ]
Wu, Junyu [1 ]
Li, Yongsheng [1 ]
Rong, Haidong [1 ]
Aublin, Pierre-Louis [3 ]
Mai, Luo [2 ]
机构
[1] Tencent, Shenzhen, Peoples R China
[2] Univ Edinburgh, Edinburgh, Midlothian, Scotland
[3] IIJ Res Lab, Tokyo, Japan
关键词
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Deep Learning Recommender Systems (DLRSs) need to update models at low latency, thus promptly serving new users and content. Existing DLRSs, however, fail to do so. They train/validate models offline and broadcast entire models to global inference clusters. They thus incur significant model update latency (e.g. dozens of minutes), which adversely affects Service-Level Objectives (SLOs). This paper describes Ekko, a novel DLRS that enables low-latency model updates. Its design idea is to allow model updates to be immediately disseminated to all inference clusters, thus bypassing long-latency model checkpoint, validation and broadcast. To realise this idea, we first design an efficient peer-to-peer model update dissemination algorithm. This algorithm exploits the sparsity and temporal locality in updating DLRS models to improve the throughput and latency of updating models. Further, Ekko has a model update scheduler that can prioritise, over busy networks, the sending of model updates that can largely affect SLOs. Finally, Ekko has an inference model state manager which monitors the SLOs of inference models and rollbacks the models if SLO-detrimental biased updates are detected. Evaluation results show that Ekko is orders of magnitude faster than state-of-the-art DLRS systems. Ekko has been deployed in production for more than one year, serves over a billion users daily and reduces the model update latency compared to state-of-the-art systems from dozens of minutes to 2.4 seconds.
引用
收藏
页码:821 / 839
页数:19
相关论文
共 50 条
  • [31] Large-scale Pollen Recognition with Deep Learning
    de Geus, Andre R.
    Barcelos, Celia A. Z.
    Batista, Marcos A.
    da Silva, Sergio F.
    2019 27TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2019,
  • [32] Learning Deep Representation with Large-scale Attributes
    Ouyang, Wanli
    Li, Hongyang
    Zeng, Xingyu
    Wang, Xiaogang
    2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 1895 - 1903
  • [33] Deep Learning on Large-scale Muticore Clusters
    Sakiyama, Kazumasa
    Kato, Shinpei
    Ishikawa, Yutaka
    Hori, Atsushi
    Monrroy, Abraham
    2018 30TH INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING (SBAC-PAD 2018), 2018, : 314 - 321
  • [34] Contrastive Learning for Debiased Candidate Generation in Large-Scale Recommender Systems
    Zhou, Chang
    Ma, Jianxin
    Zhang, Jianwei
    Zhou, Jingren
    Yang, Hongxia
    KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 3985 - 3995
  • [35] A Low-Latency and High-Performance Microwave Photonic AOA and IFM System Based on Deep Learning and FPGA
    Zhang, Longlong
    Li, Yin
    Liao, Xuan
    Hu, Xiang
    Peng, Yuanxi
    Zhou, Tong
    IEEE SENSORS JOURNAL, 2025, 25 (06) : 9934 - 9945
  • [36] Low-Latency Energy-Efficient Cyber-Physical Disaster System Using Edge Deep Learning
    Patel, Yashwant Singh
    Banerjee, Sourasekhar
    Misra, Rajiv
    Das, Sajal K.
    PROCEEDINGS OF THE 21ST INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING AND NETWORKING (ICDCN 2020), 2020,
  • [37] Hybrid recommender system based on deep learning model
    Su C.
    Huang D.
    International Journal of Performability Engineering, 2020, 16 (01) : 118 - 129
  • [38] Low-latency Virtual Network function Scheduling Algorithm Based on Deep Reinforcement Learning
    Liu, Zhiwei
    Shu, Zhaogang
    Chen, Shuwu
    Zhong, Yiwen
    Lin, Jiaxiang
    COMPUTER NETWORKS, 2024, 246
  • [39] Fragility Risks of Low Latency Dynamic Queuing in Large-Scale Clouds: Complex System Perspective
    Marbukh, V.
    2017 IFIP NETWORKING CONFERENCE (IFIP NETWORKING) AND WORKSHOPS, 2017,
  • [40] Hybrid Deep Learning Ensemble Model for Improved Large-Scale Car Recognition
    Verma, Abhishek
    Liu, Yu
    2017 IEEE SMARTWORLD, UBIQUITOUS INTELLIGENCE & COMPUTING, ADVANCED & TRUSTED COMPUTED, SCALABLE COMPUTING & COMMUNICATIONS, CLOUD & BIG DATA COMPUTING, INTERNET OF PEOPLE AND SMART CITY INNOVATION (SMARTWORLD/SCALCOM/UIC/ATC/CBDCOM/IOP/SCI), 2017,