Ekko: A Large-Scale Deep Learning Recommender System with Low-Latency Model Update

被引：0

作者：

Sima, Chijun ^{[1
]}

Fu, Yao ^{[2
]}

Sit, Man-Kit ^{[2
]}

Guo, Liyi ^{[1
]}

Gong, Xuri ^{[1
]}

Lin, Feng ^{[1
]}

Wu, Junyu ^{[1
]}

Li, Yongsheng ^{[1
]}

Rong, Haidong ^{[1
]}

Aublin, Pierre-Louis ^{[3
]}

Mai, Luo ^{[2
]}

机构：

[1] Tencent, Shenzhen, Peoples R China

[2] Univ Edinburgh, Edinburgh, Midlothian, Scotland

[3] IIJ Res Lab, Tokyo, Japan

来源：

PROCEEDINGS OF THE 16TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, OSDI 2022 | 2022年

关键词：

D O I：

暂无

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Deep Learning Recommender Systems (DLRSs) need to update models at low latency, thus promptly serving new users and content. Existing DLRSs, however, fail to do so. They train/validate models offline and broadcast entire models to global inference clusters. They thus incur significant model update latency (e.g. dozens of minutes), which adversely affects Service-Level Objectives (SLOs). This paper describes Ekko, a novel DLRS that enables low-latency model updates. Its design idea is to allow model updates to be immediately disseminated to all inference clusters, thus bypassing long-latency model checkpoint, validation and broadcast. To realise this idea, we first design an efficient peer-to-peer model update dissemination algorithm. This algorithm exploits the sparsity and temporal locality in updating DLRS models to improve the throughput and latency of updating models. Further, Ekko has a model update scheduler that can prioritise, over busy networks, the sending of model updates that can largely affect SLOs. Finally, Ekko has an inference model state manager which monitors the SLOs of inference models and rollbacks the models if SLO-detrimental biased updates are detected. Evaluation results show that Ekko is orders of magnitude faster than state-of-the-art DLRS systems. Ekko has been deployed in production for more than one year, serves over a billion users daily and reduces the model update latency compared to state-of-the-art systems from dozens of minutes to 2.4 seconds.

引用

页码：821 / 839

页数：19

共 50 条

[31] Large-scale Pollen Recognition with Deep Learning
de Geus, Andre R.
Barcelos, Celia A. Z.
Batista, Marcos A.
da Silva, Sergio F.
2019 27TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2019,
[32] Learning Deep Representation with Large-scale Attributes
Ouyang, Wanli
Li, Hongyang
Zeng, Xingyu
Wang, Xiaogang
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 1895 - 1903
[33] Deep Learning on Large-scale Muticore Clusters
Sakiyama, Kazumasa
Kato, Shinpei
Ishikawa, Yutaka
Hori, Atsushi
Monrroy, Abraham
2018 30TH INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING (SBAC-PAD 2018), 2018, : 314 - 321
[34] Contrastive Learning for Debiased Candidate Generation in Large-Scale Recommender Systems
Zhou, Chang
Ma, Jianxin
Zhang, Jianwei
Zhou, Jingren
Yang, Hongxia
KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 3985 - 3995
[35] A Low-Latency and High-Performance Microwave Photonic AOA and IFM System Based on Deep Learning and FPGA
Zhang, Longlong
Li, Yin
Liao, Xuan
Hu, Xiang
Peng, Yuanxi
Zhou, Tong
IEEE SENSORS JOURNAL, 2025, 25 (06) : 9934 - 9945
[36] Low-Latency Energy-Efficient Cyber-Physical Disaster System Using Edge Deep Learning
Patel, Yashwant Singh
Banerjee, Sourasekhar
Misra, Rajiv
Das, Sajal K.
PROCEEDINGS OF THE 21ST INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING AND NETWORKING (ICDCN 2020), 2020,
[37] Hybrid recommender system based on deep learning model
Su C.
Huang D.
International Journal of Performability Engineering, 2020, 16 (01) : 118 - 129
[38] Low-latency Virtual Network function Scheduling Algorithm Based on Deep Reinforcement Learning
Liu, Zhiwei
Shu, Zhaogang
Chen, Shuwu
Zhong, Yiwen
Lin, Jiaxiang
COMPUTER NETWORKS, 2024, 246
[39] Fragility Risks of Low Latency Dynamic Queuing in Large-Scale Clouds: Complex System Perspective
Marbukh, V.
2017 IFIP NETWORKING CONFERENCE (IFIP NETWORKING) AND WORKSHOPS, 2017,
[40] Hybrid Deep Learning Ensemble Model for Improved Large-Scale Car Recognition
Verma, Abhishek
Liu, Yu
2017 IEEE SMARTWORLD, UBIQUITOUS INTELLIGENCE & COMPUTING, ADVANCED & TRUSTED COMPUTED, SCALABLE COMPUTING & COMMUNICATIONS, CLOUD & BIG DATA COMPUTING, INTERNET OF PEOPLE AND SMART CITY INNOVATION (SMARTWORLD/SCALCOM/UIC/ATC/CBDCOM/IOP/SCI), 2017,

← 1 2 3 4 5 →