BGL: GPU-Efficient GNN Training by Optimizing Graph Data I/O and Preprocessing

被引：0

作者：

Liu, Tianfeng ^{[1
,3
,4
]}

Chen, Yangrui ^{[2
,3
]}

Li, Dan ^{[1
,4
]}

Wu, Chuan ^{[2
]}

Zhu, Yibo ^{[3
]}

He, Jun ^{[3
]}

Peng, Yanghua ^{[3
]}

Chen, Hongzheng ^{[3
,5
]}

Chen, Hongzhi ^{[3
]}

Guo, Chuanxiong ^{[3
]}

机构：

[1] Tsinghua Univ, Beijing, Peoples R China

[2] Univ Hong Kong, Hong Kong, Peoples R China

[3] ByteDance, Beijing, Peoples R China

[4] Zhongguancun Lab, Beijing, Peoples R China

[5] Cornell Univ, Ithaca, NY USA

来源：

PROCEEDINGS OF THE 20TH USENIX SYMPOSIUM ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION, NSDI 2023 | 2023年

基金：

中国国家自然科学基金;

关键词：

SYSTEM;

D O I：

暂无

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Graph neural networks (GNNs) have extended the success of deep neural networks (DNNs) to non-Euclidean graph data, achieving ground-breaking performance on various tasks such as node classification and graph property prediction. Nonetheless, existing systems are inefficient to train large graphs with billions of nodes and edges with GPUs. The main bottlenecks are the process of preparing data for GPUs - subgraph sampling and feature retrieving. This paper proposes BGL, a distributed GNN training system designed to address the bottlenecks with a few key ideas. First, we propose a dynamic cache engine to minimize feature retrieving traffic. By co-designing caching policy and the order of sampling, we find a sweet spot of low overhead and a high cache hit ratio. Second, we improve the graph partition algorithm to reduce cross-partition communication during subgraph sampling. Finally, careful resource isolation reduces contention between different data preprocessing stages. Extensive experiments on various GNN models and large graph datasets show that BGL significantly outperforms existing GNN training systems by 1.9x on average.

引用

页码：103 / 118

页数：16

共 50 条

[31] Efficient I/O and Storage of Adaptive-Resolution Data
Kumar, Sidharth
Edwards, John
Bremer, Peer-Timo
Knoll, Aaron
Christensen, Cameron
Vishwanath, Venkatram
Carns, Philip
Schmidt, John A.
Pascucci, Valerio
SC14: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2014, : 413 - 423
[32] Efficient Data Restructuring and Aggregation for I/O Acceleration in PIDX
Kumar, Sidharth
Vishwanath, Venkatram
Carns, Philip
Levine, Joshua A.
Latham, Robert
Scorzelli, Giorgio
Kolla, Hemanth
Grout, Ray
Ross, Robert
Papka, Michael E.
Chen, Jacqueline
Pascucci, Valerio
2012 INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC), 2012,
[33] Efficient parallel I/O scheduling in the presence of data duplication
Liu, PF
Wang, DW
Wu, JJ
2003 INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, PROCEEDINGS, 2003, : 231 - 238
[34] GraphMP: I/O-Efficient Big Graph Analytics on a Single Commodity Machine
Sun, Peng
Wen, Yonggang
Duong, Ta Nguyen Binh
Xiao, Xiaokui
IEEE TRANSACTIONS ON BIG DATA, 2020, 6 (04) : 816 - 829
[35] Hybrid Pulling/Pushing for I/O-Efficient Distributed and Iterative Graph Computing
Wang, Zhigang
Gu, Yu
Bao, Yubin
Yu, Ge
Yu, Jeffrey Xu
SIGMOD'16: PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2016, : 479 - 494
[36] BG3: A Cost Effective and I/O Efficient Graph Database in ByteDance
Zhang, Wei
Chen, Cheng
Wang, Qiange
Wang, Wei
Yang, Shijiao
Zhou, Bingyu
Zhu, Huiming
Chen, Chao
Zhao, Yongjun
Hu, Yingqian
Cheng, Miaomiao
Li, Meng
Tan, Hongfei
Liu, Mengjin
Lin, Hexiang
Zhang, Shuai
Zhang, Lei
COMPANION OF THE 2024 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, SIGMOD-COMPANION 2024, 2024, : 360 - 372
[37] Parallel Data-Local Training for Optimizing Word2Vec Embeddings for Word and Graph Embeddings
Moon, Gordon E.
Newman-Griffis, Denis
Kim, Jinsung
Sukumaran-Rajam, Aravind
Fosler-Lussier, Eric
Sadayappan, P.
PROCEEDINGS OF 2019 5TH IEEE/ACM WORKSHOP ON MACHINE LEARNING IN HIGH PERFORMANCE COMPUTING ENVIRONMENTS (MLHPC 2019), 2019, : 44 - 55
[38] HUS-Graph: I/O-Efficient Out-of-Core Graph Processing with Hybrid Update Strategy
Xu, Xianghao
Wang, Fang
Jiang, Hong
Cheng, Yongli
Feng, Dan
Zhang, Yongxuan
PROCEEDINGS OF THE 47TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, 2018,
[39] Optimizing Network I/O Virtualization with Efficient Interrupt Coalescing and Virtual Receive Side Scaling
Dong, Yaozu
Xu, Dongxiao
Zhang, Yang
Liao, Guangdeng
2011 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2011, : 26 - 34
[40] Optimizing the Query Performance of Block Index Through Data Analysis and I/O Modeling
Wu, Tzuhsien
Chou, Jerry
Hao, Shyng
Dong, Bin
Klasky, Scott
Wu, Kesheng
SC'17: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2017,

← 1 2 3 4 5 →