BGL: GPU-Efficient GNN Training by Optimizing Graph Data I/O and Preprocessing

被引：0

作者：

Liu, Tianfeng ^{[1
,3
,4
]}

Chen, Yangrui ^{[2
,3
]}

Li, Dan ^{[1
,4
]}

Wu, Chuan ^{[2
]}

Zhu, Yibo ^{[3
]}

He, Jun ^{[3
]}

Peng, Yanghua ^{[3
]}

Chen, Hongzheng ^{[3
,5
]}

Chen, Hongzhi ^{[3
]}

Guo, Chuanxiong ^{[3
]}

机构：

[1] Tsinghua Univ, Beijing, Peoples R China

[2] Univ Hong Kong, Hong Kong, Peoples R China

[3] ByteDance, Beijing, Peoples R China

[4] Zhongguancun Lab, Beijing, Peoples R China

[5] Cornell Univ, Ithaca, NY USA

来源：

PROCEEDINGS OF THE 20TH USENIX SYMPOSIUM ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION, NSDI 2023 | 2023年

基金：

中国国家自然科学基金;

关键词：

SYSTEM;

D O I：

暂无

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Graph neural networks (GNNs) have extended the success of deep neural networks (DNNs) to non-Euclidean graph data, achieving ground-breaking performance on various tasks such as node classification and graph property prediction. Nonetheless, existing systems are inefficient to train large graphs with billions of nodes and edges with GPUs. The main bottlenecks are the process of preparing data for GPUs - subgraph sampling and feature retrieving. This paper proposes BGL, a distributed GNN training system designed to address the bottlenecks with a few key ideas. First, we propose a dynamic cache engine to minimize feature retrieving traffic. By co-designing caching policy and the order of sampling, we find a sweet spot of low overhead and a high cache hit ratio. Second, we improve the graph partition algorithm to reduce cross-partition communication during subgraph sampling. Finally, careful resource isolation reduces contention between different data preprocessing stages. Extensive experiments on various GNN models and large graph datasets show that BGL significantly outperforms existing GNN training systems by 1.9x on average.

引用

页码：103 / 118

页数：16

共 50 条

[21] AsyncStripe: I/O Efficient Asynchronous Graph Computing on a Single Server
Cheng, Shuhan
Zhang, Guangyan
Shu, Jiwu
Zheng, Weimin
2016 INTERNATIONAL CONFERENCE ON HARDWARE/SOFTWARE CODESIGN AND SYSTEM SYNTHESIS (CODES+ISSS), 2016,
[22] GraphCP: An I/O-Efficient Concurrent Graph Processing Framework
Xu, Xianghao
Wang, Fang
Jiang, Hong
Cheng, Yongli
Feng, Dan
Zhang, Yongxuan
Fang, Peng
2021 IEEE/ACM 29TH INTERNATIONAL SYMPOSIUM ON QUALITY OF SERVICE (IWQOS), 2021,
[23] I/O Efficient Core Graph Decomposition: Application to Degeneracy Ordering
Wen, Dong
Qin, Lu
Zhang, Ying
Lin, Xuemin
Yu, Jeffrey Xu
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2019, 31 (01) : 75 - 90
[24] Efficient data-movement for lightweight I/O
Oldfield, Ron A.
Maccabe, Arthur B.
Widener, Patrick
Ward, Lee
Kordenbrock, Todd
2006 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING, VOLS 1 AND 2, 2006, : 558 - +
[25] Fastensor: Optimise the Tensor I/O Path from SSD to GPU for Deep Learning Training
Wei, Jia
Zhang, Xingjun
Wang, Longxiang
Wei, Zheng
ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2023, 20 (04)
[26] I/O-efficient GPU-based acceleration of coherent dedispersion for pulsar observation
Kong, Xiangcong
Zheng, Xiaoying
Zhu, Yongxin
Duan, Gaoxiang
Chen, Zikang
JOURNAL OF SYSTEMS ARCHITECTURE, 2023, 142
[27] EDC: An Elastic Data Cache to Optimizing the I/O Performance in Deduplicated SSDs
Lu, Mengting
Wang, Fang
Li, Zongwei
He, Wenpeng
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2022, 41 (07) : 2250 - 2262
[28] Succinct and I/O Efficient Data Structures for Traversal in Trees
Dillabaugh, Craig
He, Meng
Maheshwari, Anil
ALGORITHMS AND COMPUTATION, PROCEEDINGS, 2008, 5369 : 112 - 123
[29] Succinct and I/O Efficient Data Structures for Traversal in Trees
Dillabaugh, Craig
He, Meng
Maheshwari, Anil
ALGORITHMICA, 2012, 63 (1-2) : 201 - 223
[30] Succinct and I/O Efficient Data Structures for Traversal in Trees
Craig Dillabaugh
Meng He
Anil Maheshwari
Algorithmica, 2012, 63 : 201 - 223

← 1 2 3 4 5 →