Efficient Data Loader for Fast Sampling-Based GNN Training on Large Graphs

被引:15
|
作者
Bai, Youhui [1 ]
Li, Cheng [1 ]
Lin, Zhiqi [1 ]
Wu, Yufei [1 ]
Miao, Youshan [2 ]
Liu, Yunxin [2 ]
Xu, Yinlong [1 ,3 ]
机构
[1] Univ Sci & Technol China, Sch Comp Sci & Technol, Hefei 230026, Anhui, Peoples R China
[2] Microsoft Res, Beijing 100080, Peoples R China
[3] Anhui Prov Key Lab High Performance Comp, Hefei 230026, Anhui, Peoples R China
基金
国家重点研发计划;
关键词
Training; Graphics processing units; Loading; Computational modeling; Load modeling; Partitioning algorithms; Deep learning; Graph neural network; cache; large graph; graph partition; pipeline; multi-GPU;
D O I
10.1109/TPDS.2021.3065737
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Emerging graph neural networks (GNNs) have extended the successes of deep learning techniques against datasets like images and texts to more complex graph-structured data. By leveraging GPU accelerators, existing frameworks combine mini-batch and sampling for effective and efficient model training on large graphs. However, this setup faces a scalability issue since loading rich vertex features from CPU to GPU through a limited bandwidth link usually dominates the training cycle. In this article, we propose PaGraph, a novel, efficient data loader that supports general and efficient sampling-based GNN training on single-server with multi-GPU. PaGraph significantly reduces the data loading time by exploiting available GPU resources to keep frequently-accessed graph data with a cache. It also embodies a lightweight yet effective caching policy that takes into account graph structural information and data access patterns of sampling-based GNN training simultaneously. Furthermore, to scale out on multiple GPUs, PaGraph develops a fast GNN-computation-aware partition algorithm to avoid cross-partition access during data-parallel training and achieves better cache efficiency. Finally, it overlaps data loading and GNN computation for further hiding loading costs. Evaluations on two representative GNN models, GCN and GraphSAGE, using two sampling methods, Neighbor and Layer-wise, show that PaGraph could eliminate the data loading time from the GNN training pipeline, and achieve up to 4.8x performance speedup over the state-of-the-art baselines. Together with preprocessing optimization, PaGraph further delivers up to 16.0x end-to-end speedup.
引用
下载
收藏
页码:2541 / 2556
页数:16
相关论文
共 50 条
  • [31] An efficient deterministic sequence for sampling-based motion planners
    Rosell, J
    Heise, M
    ISATP 2005: IEEE INTERNATIONAL SYMPOSIUM ON ASSEMBLY AND TASK PLANNING (ISATP), 2005, : 212 - 217
  • [32] An improved sampling-based DBSCAN for large spatial databases
    Borah, B
    Bhattacharyya, DK
    PROCEEDINGS OF INTERNATIONAL CONFERENCE ON INTELLIGENT SENSING AND INFORMATION PROCESSING, 2004, : 92 - 96
  • [33] FAST: A new sampling-based algorithm for discovering association rules
    Bin, C
    Haas, PJ
    Scheuermann, P
    18TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 2002, : 263 - 263
  • [34] Bi-AM-RRT*: A Fast and Efficient Sampling-Based Motion Planning Algorithm in Dynamic Environments
    Zhang, Ying
    Wang, Heyong
    Yin, Maoliang
    Wang, Jiankun
    Hua, Changchun
    IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, 2024, 9 (01): : 1282 - 1293
  • [35] Sampling-Based Consensus Fuzzy Clustering on Big Data
    Zoghlami, Mohamed Ali
    Sassi Hidri, Minyar
    Ben Ayed, Rahma
    2016 IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS (FUZZ-IEEE), 2016, : 1501 - 1508
  • [36] 2PGraph: Accelerating GNN Training over Large Graphs on GPU Clusters
    Zhang, Lizhi
    Lai, Zhiquan
    Li, Shengwei
    Tang, Yu
    Liu, Feng
    Li, Dongsheng
    2021 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2021), 2021, : 103 - 113
  • [37] Sampling-based approximate skyline calculation on big data
    Xiao, Xingxing
    Li, Jianzhong
    DISCRETE MATHEMATICS ALGORITHMS AND APPLICATIONS, 2022, 14 (07)
  • [38] Efficient sampling-based Bayesian Active Learning for synaptic characterization
    Gontier, Camille
    Surace, Simone Carlo
    Delvendahl, Igor
    Mueller, Martin
    Pfister, Jean-Pascal
    PLOS COMPUTATIONAL BIOLOGY, 2023, 19 (08)
  • [39] Efficient Sampling-based Bottleneck Pathfinding over Cost Maps
    Solovey, Kiril
    Halperin, Dan
    2017 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2017, : 2003 - 2009
  • [40] A Simple and Efficient Sampling-based Algorithm for General Reachability Analysis
    Lew, Thomas
    Janson, Lucas
    Bonalli, Riccardo
    Pavone, Marco
    LEARNING FOR DYNAMICS AND CONTROL CONFERENCE, VOL 168, 2022, 168