Ekko: A Large-Scale Deep Learning Recommender System with Low-Latency Model Update

被引：0

作者：

Sima, Chijun ^{[1
]}

Fu, Yao ^{[2
]}

Sit, Man-Kit ^{[2
]}

Guo, Liyi ^{[1
]}

Gong, Xuri ^{[1
]}

Lin, Feng ^{[1
]}

Wu, Junyu ^{[1
]}

Li, Yongsheng ^{[1
]}

Rong, Haidong ^{[1
]}

Aublin, Pierre-Louis ^{[3
]}

Mai, Luo ^{[2
]}

机构：

[1] Tencent, Shenzhen, Peoples R China

[2] Univ Edinburgh, Edinburgh, Midlothian, Scotland

[3] IIJ Res Lab, Tokyo, Japan

来源：

PROCEEDINGS OF THE 16TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, OSDI 2022 | 2022年

关键词：

D O I：

暂无

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Deep Learning Recommender Systems (DLRSs) need to update models at low latency, thus promptly serving new users and content. Existing DLRSs, however, fail to do so. They train/validate models offline and broadcast entire models to global inference clusters. They thus incur significant model update latency (e.g. dozens of minutes), which adversely affects Service-Level Objectives (SLOs). This paper describes Ekko, a novel DLRS that enables low-latency model updates. Its design idea is to allow model updates to be immediately disseminated to all inference clusters, thus bypassing long-latency model checkpoint, validation and broadcast. To realise this idea, we first design an efficient peer-to-peer model update dissemination algorithm. This algorithm exploits the sparsity and temporal locality in updating DLRS models to improve the throughput and latency of updating models. Further, Ekko has a model update scheduler that can prioritise, over busy networks, the sending of model updates that can largely affect SLOs. Finally, Ekko has an inference model state manager which monitors the SLOs of inference models and rollbacks the models if SLO-detrimental biased updates are detected. Evaluation results show that Ekko is orders of magnitude faster than state-of-the-art DLRS systems. Ekko has been deployed in production for more than one year, serves over a billion users daily and reduces the model update latency compared to state-of-the-art systems from dozens of minutes to 2.4 seconds.

引用

页码：821 / 839

页数：19

共 50 条

[21] Deep Q-Learning for Low-Latency Tactile Applications: Microgrid Communications
Elsayed, Medhat
Erol-Kantarci, Melike
2018 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS, CONTROL, AND COMPUTING TECHNOLOGIES FOR SMART GRIDS (SMARTGRIDCOMM), 2018,
[22] Constrained Deep Reinforcement Learning for Low-Latency Wireless VR Video Streaming
Li, Shaoang
She, Changyang
Li, Yonghui
Vucetic, Branka
2021 IEEE GLOBAL COMMUNICATIONS CONFERENCE (GLOBECOM), 2021,
[23] Low-latency deep-reinforcement learning algorithm for ultrafast fiber lasers
QIUQUAN YAN
QINGHUI DENG
JUN ZHANG
YING ZHU
KE YIN
TENG LI
DAN WU
TIAN JIANG
Photonics Research, 2021, (08) : 1493 - 1501
[24] RecSysOps: Best Practices for Operating a Large-Scale Recommender System
Saberian, Mohammad
Basilico, Justin
15TH ACM CONFERENCE ON RECOMMENDER SYSTEMS (RECSYS 2021), 2021, : 590 - 591
[25] Enabling low-latency digital twins for large-scale UAV networks using MQTT-based communication framework
An, Dohyun
Joo, Hyeontae
Kim, Hwangnam
ICT EXPRESS, 2025, 11 (02): : 264 - 269
[26] A Low Latency Clustering Method for Large-Scale Drone Swarms
Zhu, Xiaopan
Bian, Chunjiang
Chen, Yu
Chen, Shi
IEEE ACCESS, 2019, 7 : 186260 - 186267
[27] Large-scale transport simulation by deep learning
Jie Pan
Nature Computational Science, 2021, 1 : 306 - 306
[28] Tractable large-scale deep reinforcement learning
Sarang, Nima
Poullis, Charalambos
COMPUTER VISION AND IMAGE UNDERSTANDING, 2023, 232
[29] Large-scale transport simulation by deep learning
Pan, Jie
NATURE COMPUTATIONAL SCIENCE, 2021, 1 (05): : 306 - 306
[30] The three pillars of large-scale deep learning
Hoefler, Torsten
2021 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2021, : 908 - 908

← 1 2 3 4 5 →