BGL: GPU-Efficient GNN Training by Optimizing Graph Data I/O and Preprocessing

被引:0
|
作者
Liu, Tianfeng [1 ,3 ,4 ]
Chen, Yangrui [2 ,3 ]
Li, Dan [1 ,4 ]
Wu, Chuan [2 ]
Zhu, Yibo [3 ]
He, Jun [3 ]
Peng, Yanghua [3 ]
Chen, Hongzheng [3 ,5 ]
Chen, Hongzhi [3 ]
Guo, Chuanxiong [3 ]
机构
[1] Tsinghua Univ, Beijing, Peoples R China
[2] Univ Hong Kong, Hong Kong, Peoples R China
[3] ByteDance, Beijing, Peoples R China
[4] Zhongguancun Lab, Beijing, Peoples R China
[5] Cornell Univ, Ithaca, NY USA
基金
中国国家自然科学基金;
关键词
SYSTEM;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Graph neural networks (GNNs) have extended the success of deep neural networks (DNNs) to non-Euclidean graph data, achieving ground-breaking performance on various tasks such as node classification and graph property prediction. Nonetheless, existing systems are inefficient to train large graphs with billions of nodes and edges with GPUs. The main bottlenecks are the process of preparing data for GPUs - subgraph sampling and feature retrieving. This paper proposes BGL, a distributed GNN training system designed to address the bottlenecks with a few key ideas. First, we propose a dynamic cache engine to minimize feature retrieving traffic. By co-designing caching policy and the order of sampling, we find a sweet spot of low overhead and a high cache hit ratio. Second, we improve the graph partition algorithm to reduce cross-partition communication during subgraph sampling. Finally, careful resource isolation reduces contention between different data preprocessing stages. Extensive experiments on various GNN models and large graph datasets show that BGL significantly outperforms existing GNN training systems by 1.9x on average.
引用
收藏
页码:103 / 118
页数:16
相关论文
共 50 条
  • [21] AsyncStripe: I/O Efficient Asynchronous Graph Computing on a Single Server
    Cheng, Shuhan
    Zhang, Guangyan
    Shu, Jiwu
    Zheng, Weimin
    2016 INTERNATIONAL CONFERENCE ON HARDWARE/SOFTWARE CODESIGN AND SYSTEM SYNTHESIS (CODES+ISSS), 2016,
  • [22] GraphCP: An I/O-Efficient Concurrent Graph Processing Framework
    Xu, Xianghao
    Wang, Fang
    Jiang, Hong
    Cheng, Yongli
    Feng, Dan
    Zhang, Yongxuan
    Fang, Peng
    2021 IEEE/ACM 29TH INTERNATIONAL SYMPOSIUM ON QUALITY OF SERVICE (IWQOS), 2021,
  • [23] I/O Efficient Core Graph Decomposition: Application to Degeneracy Ordering
    Wen, Dong
    Qin, Lu
    Zhang, Ying
    Lin, Xuemin
    Yu, Jeffrey Xu
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2019, 31 (01) : 75 - 90
  • [24] Efficient data-movement for lightweight I/O
    Oldfield, Ron A.
    Maccabe, Arthur B.
    Widener, Patrick
    Ward, Lee
    Kordenbrock, Todd
    2006 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING, VOLS 1 AND 2, 2006, : 558 - +
  • [25] Fastensor: Optimise the Tensor I/O Path from SSD to GPU for Deep Learning Training
    Wei, Jia
    Zhang, Xingjun
    Wang, Longxiang
    Wei, Zheng
    ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2023, 20 (04)
  • [26] I/O-efficient GPU-based acceleration of coherent dedispersion for pulsar observation
    Kong, Xiangcong
    Zheng, Xiaoying
    Zhu, Yongxin
    Duan, Gaoxiang
    Chen, Zikang
    JOURNAL OF SYSTEMS ARCHITECTURE, 2023, 142
  • [27] EDC: An Elastic Data Cache to Optimizing the I/O Performance in Deduplicated SSDs
    Lu, Mengting
    Wang, Fang
    Li, Zongwei
    He, Wenpeng
    IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2022, 41 (07) : 2250 - 2262
  • [28] Succinct and I/O Efficient Data Structures for Traversal in Trees
    Dillabaugh, Craig
    He, Meng
    Maheshwari, Anil
    ALGORITHMS AND COMPUTATION, PROCEEDINGS, 2008, 5369 : 112 - 123
  • [29] Succinct and I/O Efficient Data Structures for Traversal in Trees
    Dillabaugh, Craig
    He, Meng
    Maheshwari, Anil
    ALGORITHMICA, 2012, 63 (1-2) : 201 - 223
  • [30] Succinct and I/O Efficient Data Structures for Traversal in Trees
    Craig Dillabaugh
    Meng He
    Anil Maheshwari
    Algorithmica, 2012, 63 : 201 - 223