InTune: Reinforcement Learning-based Data Pipeline Optimization for Deep Recommendation Models

被引：1

作者：

Nagrecha, Kabir ^{[1
]}

Liu, Lingyi ^{[1
]}

Delgado, Pablo ^{[1
]}

Padmanabhan, Prasanna ^{[1
]}

机构：

[1] Netflix Inc, Los Gatos, CA 95032 USA

来源：

PROCEEDINGS OF THE 17TH ACM CONFERENCE ON RECOMMENDER SYSTEMS, RECSYS 2023 | 2023年

关键词：

data processing; recommendation systems; deep learning; parallel computing; resource allocation;

D O I：

10.1145/3604915.3608778

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Deep learning-based recommender models (DLRMs) have become an essential component of many modern recommender systems. Several companies are nowbuilding large compute clusters reserved only for DLRM training, driving new interest in cost- & time- saving optimizations. The systems challenges faced in this setting are unique; while typical deep learning (DL) training jobs are dominated by model execution times, the most important factor in DLRM training performance is often online data ingestion. In this paper, we explore the unique characteristics of this data ingestion problem and provide insights into the specific bottlenecks and challenges of the DLRM training pipeline at scale. We study real-world DLRM data processing pipelines taken from our compute cluster at Netflix to both observe the performance impacts of online ingestion and to identify shortfalls in existing data pipeline optimizers. We find that current tooling either yields sub-optimal performance, frequent crashes, or else requires impractical cluster re-organization to adopt. Our studies lead us to design and build a new solution for data pipeline optimization, InTune. InTune employs a reinforcement learning (RL) agent to learn how to distribute the CPU resources of a trainer machine across a DLRM data pipeline to more effectively parallelize data-loading and improve throughput. Our experiments show that InTune can build an optimized data pipeline configuration within only a few minutes, and can easily be integrated into existing training workflows. By exploiting the responsiveness and adaptability of RL, InTune achieves significantly higher online data ingestion rates than existing optimizers, thus reducing idle times in model execution and increasing efficiency. We apply InTune to our real-world cluster, and find that it increases data ingestion throughput by as much as 2.29X versus current state-of-the-art data pipeline optimizers while also improving both CPU & GPU utilization.

引用

页码：430 / 442

页数：13

共 50 条

[31] DeepRLB: A deep reinforcement learning-based load balancing in data center networks
Rikhtegar, Negar
Bushehrian, Omid
Keshtgari, Manijeh
INTERNATIONAL JOURNAL OF COMMUNICATION SYSTEMS, 2021, 34 (15)
[32] A Deep Reinforcement Learning-Based Caching Strategy for IoT Networks With Transient Data
Wu, Hongda
Nasehzadeh, Ali
Wang, Ping
IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, 2022, 71 (12) : 13310 - 13319
[33] Deep learning-based viewpoint recommendation in volume visualization
Yang, Changhe
Li, Yanda
Liu, Can
Yuan, Xiaoru
JOURNAL OF VISUALIZATION, 2019, 22 (05) : 991 - 1003
[34] Deep learning-based viewpoint recommendation in volume visualization
Changhe Yang
Yanda Li
Can Liu
Xiaoru Yuan
Journal of Visualization, 2019, 22 : 991 - 1003
[35] Reinforcement Learning-Based Trajectory Optimization for Data Muling With Underwater Mobile Nodes
Fu, Qiang
Song, Aijun
Zhang, Fuming
Pan, Miao
IEEE ACCESS, 2022, 10 : 38774 - 38784
[36] BoilerNet: Deep reinforcement learning-based combustion optimization network for pulverized coal boiler
Wang, Zhi
Yin, Yongbo
Yao, Guojia
Li, Kuangyu
Liu, Yang
Liu, Xuanqi
Tang, Zhenhao
Zhang, Fan
Peng, Xianyong
Lin, Jinxing
Zhu, Hang
Zhou, Huaichun
Energy, 2025, 318
[37] Deep Reinforcement Learning-Based Optimization for IRS-Assisted Cognitive Radio Systems
Zhong, Canwei
Cui, Miao
Zhang, Guangchi
Wu, Qingqing
Guan, Xinrong
Chu, Xiaoli
Poor, H. Vincent
IEEE TRANSACTIONS ON COMMUNICATIONS, 2022, 70 (06) : 3849 - 3864
[38] Deep Reinforcement Learning-Based Ground-Via Placement Optimization for EMI Mitigation
Gu, Zheming
Zhang, Ling
Jin, Hang
Tao, Tuomin
Li, Da
Li, Er-Ping
IEEE TRANSACTIONS ON ELECTROMAGNETIC COMPATIBILITY, 2023, 65 (02) : 564 - 573
[39] Deep reinforcement learning-based resilience optimization for infrastructure networks restoration with multiple crews
Qiang Feng
Qilong Wu
Xingshuo Hai
Yi Ren
Changyun Wen
Zili Wang
Frontiers of Engineering Management, 2025, 12 (1) : 141 - 153
[40] A deep reinforcement learning-based method for predictive management of demand response in natural gas pipeline networks
Fan, Lin
Su, Huai
Zio, Enrico
Chi, Lixun
Zhang, Li
Zhou, Jing
Liu, Zhe
Zhan, Jinjun
JOURNAL OF CLEANER PRODUCTION, 2022, 335

← 1 2 3 4 5 →