G-Meta: Distributed Meta Learning in GPU Clusters for Large-Scale Recommender Systems

被引：0

作者：

Xiao, Youshao ^{[1
]}

Zhao, Shangchun ^{[1
]}

Zhou, Zhenglei ^{[1
]}

Huan, Zhaoxin ^{[1
]}

Ju, Lin ^{[1
]}

Zhang, Xiaolu ^{[1
]}

Wang, Lin ^{[1
]}

Zhou, Jun ^{[1
]}

机构：

[1] Ant Grp, Hangzhou, Peoples R China

来源：

PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023 | 2023年

关键词：

Recommender System; Deep Meta Learning; Distributed Training;

D O I：

10.1145/3583780.3615208

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, a new paradigm, meta learning, has been widely applied to Deep Learning Recommendation Models (DLRM) and significantly improves statistical performance, especially in cold-start scenarios. However, the existing systems are not tailored for meta learning based DLRM models and have critical problems regarding efficiency in distributed training in the GPU cluster. It is because the conventional deep learning pipeline is not optimized for two task-specific datasets and two update loops in meta learning. This paper provides a high-performance framework for large-scale training for Optimization-based Meta DLRM models over the GPU cluster, namely G-Meta. Firstly, G-Meta utilizes both data parallelism and model parallelism with careful orchestration regarding computation and communication efficiency, to enable high-speed distributed training. Secondly, it proposes a Meta-IO pipeline for efficient data ingestion to alleviate the I/O bottleneck. Various experimental results show that G-Meta achieves notable training speed without loss of statistical performance. Since early 2022, G-Meta has been deployed in Alipay's core advertising and recommender system, shrinking the continuous delivery of models by four times. It also obtains 6.48% improvement in Conversion Rate (CVR) and 1.06% increase in CPM (Cost Per Mille) in Alipay's homepage display advertising, with the benefit of larger training samples and tasks.

引用

页码：4365 / 4369

页数：5

共 50 条

[1] Distributed Equivalent Substitution Training for Large-Scale Recommender Systems
Rong, Haidong
Wang, Yangzihao
Zhou, Feihu
Zhai, Junjie
Wu, Haiyang
Lan, Rui
Li, Fan
Zhang, Han
Yang, Yuekui
Guo, Zhenyu
Wang, Di
[J]. PROCEEDINGS OF THE 43RD INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '20), 2020, : 911 - 920
[2] Online Learning in Large-Scale Contextual Recommender Systems
Song, Linqi
Tekin, Cem
van der Schaar, Mihaela
[J]. IEEE TRANSACTIONS ON SERVICES COMPUTING, 2016, 9 (03) : 433 - 445
[3] A study of dynamic meta-learning for failure prediction in large-scale systems
Lan, Zhiling
Gu, Jiexing
Zheng, Ziming
Thakur, Rajeev
Coghlan, Susan
[J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2010, 70 (06) : 630 - 643
[4] Contrastive Learning for Debiased Candidate Generation in Large-Scale Recommender Systems
Zhou, Chang
Ma, Jianxin
Zhang, Jianwei
Zhou, Jingren
Yang, Hongxia
[J]. KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 3985 - 3995
[5] Large-Scale Bandit Approaches for Recommender Systems
Zhou, Qian
Zhang, XiaoFang
Xu, Jin
Liang, Bin
[J]. NEURAL INFORMATION PROCESSING, ICONIP 2017, PT I, 2017, 10634 : 811 - 821
[6] Large-Scale Meta-Learning with Continual Trajectory Shifting
Shin, JaeWoong
Lee, Hae Beom
Gong, Boqing
Hwang, Sung Ju
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
[7] Efficient MPI-AllReduce for large-scale deep learning on GPU-clusters
Truong Thao Nguyen
Wahib, Mohamed
Takano, Ryousei
[J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2021, 33 (12):
[8] Sequential Learning over Implicit Feedback for Robust Large-Scale Recommender Systems
Burashnikova, Aleksandra
Maximov, Yury
Amini, Massih-Reza
[J]. MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2019, PT III, 2020, 11908 : 253 - 269
[9] FEDERATED LEARNING IN DISTRIBUTED MEDICAL DATABASES: META-ANALYSIS OF LARGE-SCALE SUBCORTICAL BRAIN DATA
Silva, Santiago
Gutman, Boris A.
Romero, Eduardo
Thompson, Paul M.
Altmann, Andre
Lorenzi, Marco
[J]. 2019 IEEE 16TH INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING (ISBI 2019), 2019, : 270 - 274
[10] Lattice Boltzmann for Large-Scale GPU Systems
Gray, Alan
Hart, Alistair
Richardson, Alan
Stratford, Kevin
[J]. APPLICATIONS, TOOLS AND TECHNIQUES ON THE ROAD TO EXASCALE COMPUTING, 2012, 22 : 167 - 174

← 1 2 3 4 5 →