G-Meta: Distributed Meta Learning in GPU Clusters for Large-Scale Recommender Systems

被引:0
|
作者
Xiao, Youshao [1 ]
Zhao, Shangchun [1 ]
Zhou, Zhenglei [1 ]
Huan, Zhaoxin [1 ]
Ju, Lin [1 ]
Zhang, Xiaolu [1 ]
Wang, Lin [1 ]
Zhou, Jun [1 ]
机构
[1] Ant Grp, Hangzhou, Peoples R China
关键词
Recommender System; Deep Meta Learning; Distributed Training;
D O I
10.1145/3583780.3615208
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, a new paradigm, meta learning, has been widely applied to Deep Learning Recommendation Models (DLRM) and significantly improves statistical performance, especially in cold-start scenarios. However, the existing systems are not tailored for meta learning based DLRM models and have critical problems regarding efficiency in distributed training in the GPU cluster. It is because the conventional deep learning pipeline is not optimized for two task-specific datasets and two update loops in meta learning. This paper provides a high-performance framework for large-scale training for Optimization-based Meta DLRM models over the GPU cluster, namely G-Meta. Firstly, G-Meta utilizes both data parallelism and model parallelism with careful orchestration regarding computation and communication efficiency, to enable high-speed distributed training. Secondly, it proposes a Meta-IO pipeline for efficient data ingestion to alleviate the I/O bottleneck. Various experimental results show that G-Meta achieves notable training speed without loss of statistical performance. Since early 2022, G-Meta has been deployed in Alipay's core advertising and recommender system, shrinking the continuous delivery of models by four times. It also obtains 6.48% improvement in Conversion Rate (CVR) and 1.06% increase in CPM (Cost Per Mille) in Alipay's homepage display advertising, with the benefit of larger training samples and tasks.
引用
收藏
页码:4365 / 4369
页数:5
相关论文
共 50 条
  • [1] Distributed Equivalent Substitution Training for Large-Scale Recommender Systems
    Rong, Haidong
    Wang, Yangzihao
    Zhou, Feihu
    Zhai, Junjie
    Wu, Haiyang
    Lan, Rui
    Li, Fan
    Zhang, Han
    Yang, Yuekui
    Guo, Zhenyu
    Wang, Di
    [J]. PROCEEDINGS OF THE 43RD INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '20), 2020, : 911 - 920
  • [2] Online Learning in Large-Scale Contextual Recommender Systems
    Song, Linqi
    Tekin, Cem
    van der Schaar, Mihaela
    [J]. IEEE TRANSACTIONS ON SERVICES COMPUTING, 2016, 9 (03) : 433 - 445
  • [3] A study of dynamic meta-learning for failure prediction in large-scale systems
    Lan, Zhiling
    Gu, Jiexing
    Zheng, Ziming
    Thakur, Rajeev
    Coghlan, Susan
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2010, 70 (06) : 630 - 643
  • [4] Contrastive Learning for Debiased Candidate Generation in Large-Scale Recommender Systems
    Zhou, Chang
    Ma, Jianxin
    Zhang, Jianwei
    Zhou, Jingren
    Yang, Hongxia
    [J]. KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 3985 - 3995
  • [5] Large-Scale Bandit Approaches for Recommender Systems
    Zhou, Qian
    Zhang, XiaoFang
    Xu, Jin
    Liang, Bin
    [J]. NEURAL INFORMATION PROCESSING, ICONIP 2017, PT I, 2017, 10634 : 811 - 821
  • [6] Large-Scale Meta-Learning with Continual Trajectory Shifting
    Shin, JaeWoong
    Lee, Hae Beom
    Gong, Boqing
    Hwang, Sung Ju
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [7] Efficient MPI-AllReduce for large-scale deep learning on GPU-clusters
    Truong Thao Nguyen
    Wahib, Mohamed
    Takano, Ryousei
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2021, 33 (12):
  • [8] Sequential Learning over Implicit Feedback for Robust Large-Scale Recommender Systems
    Burashnikova, Aleksandra
    Maximov, Yury
    Amini, Massih-Reza
    [J]. MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2019, PT III, 2020, 11908 : 253 - 269
  • [9] FEDERATED LEARNING IN DISTRIBUTED MEDICAL DATABASES: META-ANALYSIS OF LARGE-SCALE SUBCORTICAL BRAIN DATA
    Silva, Santiago
    Gutman, Boris A.
    Romero, Eduardo
    Thompson, Paul M.
    Altmann, Andre
    Lorenzi, Marco
    [J]. 2019 IEEE 16TH INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING (ISBI 2019), 2019, : 270 - 274
  • [10] Lattice Boltzmann for Large-Scale GPU Systems
    Gray, Alan
    Hart, Alistair
    Richardson, Alan
    Stratford, Kevin
    [J]. APPLICATIONS, TOOLS AND TECHNIQUES ON THE ROAD TO EXASCALE COMPUTING, 2012, 22 : 167 - 174