InTune: Reinforcement Learning-based Data Pipeline Optimization for Deep Recommendation Models

被引:1
|
作者
Nagrecha, Kabir [1 ]
Liu, Lingyi [1 ]
Delgado, Pablo [1 ]
Padmanabhan, Prasanna [1 ]
机构
[1] Netflix Inc, Los Gatos, CA 95032 USA
关键词
data processing; recommendation systems; deep learning; parallel computing; resource allocation;
D O I
10.1145/3604915.3608778
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Deep learning-based recommender models (DLRMs) have become an essential component of many modern recommender systems. Several companies are nowbuilding large compute clusters reserved only for DLRM training, driving new interest in cost- & time- saving optimizations. The systems challenges faced in this setting are unique; while typical deep learning (DL) training jobs are dominated by model execution times, the most important factor in DLRM training performance is often online data ingestion. In this paper, we explore the unique characteristics of this data ingestion problem and provide insights into the specific bottlenecks and challenges of the DLRM training pipeline at scale. We study real-world DLRM data processing pipelines taken from our compute cluster at Netflix to both observe the performance impacts of online ingestion and to identify shortfalls in existing data pipeline optimizers. We find that current tooling either yields sub-optimal performance, frequent crashes, or else requires impractical cluster re-organization to adopt. Our studies lead us to design and build a new solution for data pipeline optimization, InTune. InTune employs a reinforcement learning (RL) agent to learn how to distribute the CPU resources of a trainer machine across a DLRM data pipeline to more effectively parallelize data-loading and improve throughput. Our experiments show that InTune can build an optimized data pipeline configuration within only a few minutes, and can easily be integrated into existing training workflows. By exploiting the responsiveness and adaptability of RL, InTune achieves significantly higher online data ingestion rates than existing optimizers, thus reducing idle times in model execution and increasing efficiency. We apply InTune to our real-world cluster, and find that it increases data ingestion throughput by as much as 2.29X versus current state-of-the-art data pipeline optimizers while also improving both CPU & GPU utilization.
引用
收藏
页码:430 / 442
页数:13
相关论文
共 50 条
  • [31] DeepRLB: A deep reinforcement learning-based load balancing in data center networks
    Rikhtegar, Negar
    Bushehrian, Omid
    Keshtgari, Manijeh
    INTERNATIONAL JOURNAL OF COMMUNICATION SYSTEMS, 2021, 34 (15)
  • [32] A Deep Reinforcement Learning-Based Caching Strategy for IoT Networks With Transient Data
    Wu, Hongda
    Nasehzadeh, Ali
    Wang, Ping
    IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, 2022, 71 (12) : 13310 - 13319
  • [33] Deep learning-based viewpoint recommendation in volume visualization
    Yang, Changhe
    Li, Yanda
    Liu, Can
    Yuan, Xiaoru
    JOURNAL OF VISUALIZATION, 2019, 22 (05) : 991 - 1003
  • [34] Deep learning-based viewpoint recommendation in volume visualization
    Changhe Yang
    Yanda Li
    Can Liu
    Xiaoru Yuan
    Journal of Visualization, 2019, 22 : 991 - 1003
  • [35] Reinforcement Learning-Based Trajectory Optimization for Data Muling With Underwater Mobile Nodes
    Fu, Qiang
    Song, Aijun
    Zhang, Fuming
    Pan, Miao
    IEEE ACCESS, 2022, 10 : 38774 - 38784
  • [36] BoilerNet: Deep reinforcement learning-based combustion optimization network for pulverized coal boiler
    Wang, Zhi
    Yin, Yongbo
    Yao, Guojia
    Li, Kuangyu
    Liu, Yang
    Liu, Xuanqi
    Tang, Zhenhao
    Zhang, Fan
    Peng, Xianyong
    Lin, Jinxing
    Zhu, Hang
    Zhou, Huaichun
    Energy, 2025, 318
  • [37] Deep Reinforcement Learning-Based Optimization for IRS-Assisted Cognitive Radio Systems
    Zhong, Canwei
    Cui, Miao
    Zhang, Guangchi
    Wu, Qingqing
    Guan, Xinrong
    Chu, Xiaoli
    Poor, H. Vincent
    IEEE TRANSACTIONS ON COMMUNICATIONS, 2022, 70 (06) : 3849 - 3864
  • [38] Deep Reinforcement Learning-Based Ground-Via Placement Optimization for EMI Mitigation
    Gu, Zheming
    Zhang, Ling
    Jin, Hang
    Tao, Tuomin
    Li, Da
    Li, Er-Ping
    IEEE TRANSACTIONS ON ELECTROMAGNETIC COMPATIBILITY, 2023, 65 (02) : 564 - 573
  • [39] Deep reinforcement learning-based resilience optimization for infrastructure networks restoration with multiple crews
    Qiang Feng
    Qilong Wu
    Xingshuo Hai
    Yi Ren
    Changyun Wen
    Zili Wang
    Frontiers of Engineering Management, 2025, 12 (1) : 141 - 153
  • [40] A deep reinforcement learning-based method for predictive management of demand response in natural gas pipeline networks
    Fan, Lin
    Su, Huai
    Zio, Enrico
    Chi, Lixun
    Zhang, Li
    Zhou, Jing
    Liu, Zhe
    Zhan, Jinjun
    JOURNAL OF CLEANER PRODUCTION, 2022, 335