Deep learning based data prefetching in CPU-GPU unified virtual memory

被引:5
|
作者
Long, Xinjian [1 ,2 ]
Gong, Xiangyang [1 ,2 ]
Zhang, Bo [1 ,2 ]
Zhou, Huiyang [3 ]
机构
[1] Beijing Univ Posts & Telecommun, State Key Lab Networking & Switching Technol, Beijing 100876, Peoples R China
[2] Beijing Univ Posts & Telecommun, Sch Comp Sci, Natl Pilot Software Engn Sch, Beijing 100876, Peoples R China
[3] North Carolina State Univ, Dept Elect & Comp Engn, Raleigh, NC 27606 USA
基金
中国国家自然科学基金;
关键词
Data prefetching; Graphics processing unit; Unified virtual memory; Deep learning; Transformer;
D O I
10.1016/j.jpdc.2022.12.004
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Unified Virtual Memory (UVM) relieves the developers from the onus of maintaining complex data structures and explicit data migration by enabling on-demand data movement between CPU memory and GPU memory. However, on-demand paging soon becomes a performance bottleneck of UVM due to the high latency caused by page table walks and data migration over interconnect. Prefetching is considered a promising solution to this problem given its ability to leverage the locality of program memory access patterns. However, existing locality-based prefetching schemes can not handle all the situations. An ideal prefetcher should not only look at narrow regions of the requested address space but also capture global context to deliver a good prediction of the memory access pattern. This paper proposes a novel framework for page prefetching for UVM through deep learning. We first show that a powerful Transformer learning model can provide high accuracy for UVM page prefetching. We then perform analysis to interpret this Transformer model and derive several insights that allow us to design a simpler model to match the unconstrained model's accuracy with orders of magnitude lower cost. We use a pattern-based method to make the UVM page preditor general to different GPU workloads. We evaluate this framework on a set of 11 memory-intensive benchmarks from popular benchmark suites. Our solution outperforms the state-of-the-art (SOTA) UVM framework, improving the performance by 10.89%, improving the device memory page hit rate by 16.98% (89.02% vs. 76.10% for prior art), and reducing the CPU-GPU interconnect traffic by 11.05%. According to our proposed unified metric, which combines the accuracy, coverage, and page hit rate, our solution is approaching the ideal prefetching scheme more than the SOTA design (0.90 vs. 0.85, with the perfect prefetcher of 1.0).(c) 2022 Elsevier Inc. All rights reserved.
引用
收藏
页码:19 / 31
页数:13
相关论文
共 50 条
  • [1] An Intelligent Framework for Oversubscription Management in CPU-GPU Unified Memory
    Xinjian Long
    Xiangyang Gong
    Bo Zhang
    Huiyang Zhou
    Journal of Grid Computing, 2023, 21
  • [2] An Adaptive Framework for Oversubscription Management in CPU-GPU Unified Memory
    Ganguly, Debashis
    Melhem, Rami
    Yang, Jun
    PROCEEDINGS OF THE 2021 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE 2021), 2021, : 1212 - 1217
  • [3] An Intelligent Framework for Oversubscription Management in CPU-GPU Unified Memory
    Long, Xinjian
    Gong, Xiangyang
    Zhang, Bo
    Zhou, Huiyang
    JOURNAL OF GRID COMPUTING, 2023, 21 (01)
  • [4] Interplay between Hardware Prefetcher and Page Eviction Policy in CPU-GPU Unified Virtual Memory
    Ganguly, Debashis
    Zhang, Ziyu
    Yang, Jun
    Melhem, Rami
    PROCEEDINGS OF THE 2019 46TH INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA '19), 2019, : 224 - 235
  • [5] A Unified CPU-GPU Protocol for GNN Training
    Lin, Yi-Chien
    Deng, Gangda
    Prasanna, Viktor
    PROCEEDINGS OF THE 21ST ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS 2024, CF 2024, 2024, : 155 - 163
  • [6] A collaborative CPU-GPU approach for deep learning on mobile devices
    Valery, Olivier
    Liu, Pangfeng
    Wu, Jan-Jan
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2019, 31 (17):
  • [7] Raptor: Mitigating CPU-GPU False Sharing Under Unified Memory Systems
    Rafi, Md Erfanul Haque
    Williams, Kaylee
    Qasem, Apan
    2022 IEEE 13TH INTERNATIONAL GREEN AND SUSTAINABLE COMPUTING CONFERENCE (IGSC), 2022, : 41 - 48
  • [8] A unified schedule policy of distributed machine learning framework for CPU-GPU cluster
    Zhu, Ziyu
    Tang, Xiaochun
    Zhao, Quan
    Xibei Gongye Daxue Xuebao/Journal of Northwestern Polytechnical University, 2021, 39 (03): : 529 - 538
  • [9] FusionFlow: Accelerating Data Preprocessing for Machine Learning with CPU-GPU Cooperation
    Kim, Taeyoon
    Park, ChanHo
    Mukimbekov, Mansur
    Hong, Heelim
    Kim, Minseok
    Jin, Ze
    Kim, Changdae
    Shin, Ji-Yong
    Jeon, Myeongjae
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2023, 17 (04): : 863 - 876
  • [10] Demystifying the TensorFlow Eager Execution of Deep Learning Inference on a CPU-GPU Tandem
    Delestrac, Paul
    Torres, Lionel
    Novo, David
    2022 25TH EUROMICRO CONFERENCE ON DIGITAL SYSTEM DESIGN (DSD), 2022, : 446 - 455