Deep learning based data prefetching in CPU-GPU unified virtual memory

被引：5

作者：

Long, Xinjian ^{[1
,2
]}

Gong, Xiangyang ^{[1
,2
]}

Zhang, Bo ^{[1
,2
]}

Zhou, Huiyang ^{[3
]}

机构：

[1] Beijing Univ Posts & Telecommun, State Key Lab Networking & Switching Technol, Beijing 100876, Peoples R China

[2] Beijing Univ Posts & Telecommun, Sch Comp Sci, Natl Pilot Software Engn Sch, Beijing 100876, Peoples R China

[3] North Carolina State Univ, Dept Elect & Comp Engn, Raleigh, NC 27606 USA

来源：

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING | 2023年 / 174卷

基金：

中国国家自然科学基金;

关键词：

Data prefetching; Graphics processing unit; Unified virtual memory; Deep learning; Transformer;

D O I：

10.1016/j.jpdc.2022.12.004

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Unified Virtual Memory (UVM) relieves the developers from the onus of maintaining complex data structures and explicit data migration by enabling on-demand data movement between CPU memory and GPU memory. However, on-demand paging soon becomes a performance bottleneck of UVM due to the high latency caused by page table walks and data migration over interconnect. Prefetching is considered a promising solution to this problem given its ability to leverage the locality of program memory access patterns. However, existing locality-based prefetching schemes can not handle all the situations. An ideal prefetcher should not only look at narrow regions of the requested address space but also capture global context to deliver a good prediction of the memory access pattern. This paper proposes a novel framework for page prefetching for UVM through deep learning. We first show that a powerful Transformer learning model can provide high accuracy for UVM page prefetching. We then perform analysis to interpret this Transformer model and derive several insights that allow us to design a simpler model to match the unconstrained model's accuracy with orders of magnitude lower cost. We use a pattern-based method to make the UVM page preditor general to different GPU workloads. We evaluate this framework on a set of 11 memory-intensive benchmarks from popular benchmark suites. Our solution outperforms the state-of-the-art (SOTA) UVM framework, improving the performance by 10.89%, improving the device memory page hit rate by 16.98% (89.02% vs. 76.10% for prior art), and reducing the CPU-GPU interconnect traffic by 11.05%. According to our proposed unified metric, which combines the accuracy, coverage, and page hit rate, our solution is approaching the ideal prefetching scheme more than the SOTA design (0.90 vs. 0.85, with the perfect prefetcher of 1.0).(c) 2022 Elsevier Inc. All rights reserved.

引用

页码：19 / 31

页数：13

共 50 条

[1] An Intelligent Framework for Oversubscription Management in CPU-GPU Unified Memory
Xinjian Long
Xiangyang Gong
Bo Zhang
Huiyang Zhou
Journal of Grid Computing, 2023, 21
[2] An Adaptive Framework for Oversubscription Management in CPU-GPU Unified Memory
Ganguly, Debashis
Melhem, Rami
Yang, Jun
PROCEEDINGS OF THE 2021 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE 2021), 2021, : 1212 - 1217
[3] An Intelligent Framework for Oversubscription Management in CPU-GPU Unified Memory
Long, Xinjian
Gong, Xiangyang
Zhang, Bo
Zhou, Huiyang
JOURNAL OF GRID COMPUTING, 2023, 21 (01)
[4] Interplay between Hardware Prefetcher and Page Eviction Policy in CPU-GPU Unified Virtual Memory
Ganguly, Debashis
Zhang, Ziyu
Yang, Jun
Melhem, Rami
PROCEEDINGS OF THE 2019 46TH INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA '19), 2019, : 224 - 235
[5] A Unified CPU-GPU Protocol for GNN Training
Lin, Yi-Chien
Deng, Gangda
Prasanna, Viktor
PROCEEDINGS OF THE 21ST ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS 2024, CF 2024, 2024, : 155 - 163
[6] A collaborative CPU-GPU approach for deep learning on mobile devices
Valery, Olivier
Liu, Pangfeng
Wu, Jan-Jan
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2019, 31 (17):
[7] Raptor: Mitigating CPU-GPU False Sharing Under Unified Memory Systems
Rafi, Md Erfanul Haque
Williams, Kaylee
Qasem, Apan
2022 IEEE 13TH INTERNATIONAL GREEN AND SUSTAINABLE COMPUTING CONFERENCE (IGSC), 2022, : 41 - 48
[8] A unified schedule policy of distributed machine learning framework for CPU-GPU cluster
Zhu, Ziyu
Tang, Xiaochun
Zhao, Quan
Xibei Gongye Daxue Xuebao/Journal of Northwestern Polytechnical University, 2021, 39 (03): : 529 - 538
[9] FusionFlow: Accelerating Data Preprocessing for Machine Learning with CPU-GPU Cooperation
Kim, Taeyoon
Park, ChanHo
Mukimbekov, Mansur
Hong, Heelim
Kim, Minseok
Jin, Ze
Kim, Changdae
Shin, Ji-Yong
Jeon, Myeongjae
PROCEEDINGS OF THE VLDB ENDOWMENT, 2023, 17 (04): : 863 - 876
[10] Demystifying the TensorFlow Eager Execution of Deep Learning Inference on a CPU-GPU Tandem
Delestrac, Paul
Torres, Lionel
Novo, David
2022 25TH EUROMICRO CONFERENCE ON DIGITAL SYSTEM DESIGN (DSD), 2022, : 446 - 455

← 1 2 3 4 5 →