Co-Scheduling on Fused CPU-GPU Architectures With Shared Last Level Caches

被引：7

作者：

Damschen, Marvin ^{[1
]}

Mueller, Frank ^{[2
]}

Henkel, Joerg ^{[1
]}

机构：

[1] Karlsruhe Inst Technol, Chair Embedded Syst, D-76131 Karlsruhe, Germany

[2] North Carolina State Univ, Dept Comp Sci, Raleigh, NC 27695 USA

来源：

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS | 2018年 / 37卷 / 11期

基金：

美国国家科学基金会;

关键词：

Heterogeneous computing; integrated architecture; performance tuning; scheduling;

D O I：

10.1109/TCAD.2018.2857042

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Fused CPU-GPU architectures integrate a CPU and general-purpose GPU on a single die. Recent fused architectures even share the last level cache (LLC) between CPU and GPU. This enables hardware-supported byte-level coherency. Thus, CPU and GPU can execute computational kernels collaboratively, but novel methods to co-schedule work are required. This paper contributes three dynamic co-scheduling methods. Two of our methods implement workers that autonomously acquire work from a common set of independent work items (similar to bag-of-tasks scheduling). The third method, host-side profiling, uses a fraction of the total work of a kernel to determine a ratio of how to distribute work to CPU and GPU based on profiling. The resulting ratio is used for the following executions of the same kernel. Our methods are realized using OpenCL 2.0, which introduces fine-grained shared virtual memory (SVM) to allocate coherent memory between CPU and GPU. We port the Rodinia Benchmark Suite, a standard suite for heterogeneous computing, to fine-grained SVM and fused CPU-GPU architectures (Rodinia-SVM). We evaluate the overhead of fine-grained SVM and analyze the suitability of OpenCL 2.0's new features for co-scheduling. Our host-side profiling method performs competitively to the optimal choice of executing kernels either on CPU or GPU (hypothetical xor-Oracle). On average, it achieves 97% of xor-Oracle's performance and a 1.43x speedup over using the GPU alone (standard in Rodinia). We show, however, that in most cases it is not beneficial to split the work of a kernel between CPU and GPU compared to exclusively running it on the most suitable single compute device. For a fixed amount of work per device, cache-related stalls can increase by up to 1.75x when both devices are used in parallel instead of exclusively while cache misses remain the same. Thus, not the cost of cache conflicts, but inefficient cache coherence is a major performance bottleneck for current fused CPU-GPU Intel architectures with shared LLC.

引用

页码：2337 / 2347

页数：11

共 50 条

[1] CPU-Assisted GPGPU on Fused CPU-GPU Architectures
Yang, Yi
Xiang, Ping
Mantor, Mike
Zhou, Huiyang
[J]. 2012 IEEE 18TH INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA), 2012, : 103 - 114
[2] WCET Analysis of the Shared Data Cache in Integrated CPU-GPU Architectures
Huangfu, Yijie
Zhang, Wei
[J]. 2017 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2017,
[3] Orchestrated Co-scheduling, Resource Partitioning, and Power Capping on CPU-GPU Heterogeneous Systems via Machine Learning
Saba, Issa
Arima, Eishi
Liu, Dai
Schulz, Martin
[J]. ARCHITECTURE OF COMPUTING SYSTEMS, ARCS 2022, 2022, 13642 : 51 - 67
[4] Selective GPU Caches to Eliminate CPU-GPU HW Cache Coherence
Agarwal, Neha
Nellans, David
Ebrahimi, Eiman
Wenisch, Thomas F.
Danskin, John
Keckler, Stephen W.
[J]. PROCEEDINGS OF THE 2016 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE (HPCA-22), 2016, : 494 - 506
[5] SRAM- and STT-RAM-based hybrid, shared last-level cache for on-chip CPU-GPU heterogeneous architectures
Gao, Lan
Wang, Rui
Xu, Yunlong
Yang, Hailong
Luan, Zhongzhi
Qian, Depei
Zhang, Han
Cai, Jihong
[J]. JOURNAL OF SUPERCOMPUTING, 2018, 74 (07): : 3388 - 3414
[6] CoFB: latency-constrained co-scheduling of flows and batches for deep learning inference service on the CPU-GPU system
Zhang, Qi
Liu, Yi
Liu, Tao
Qian, Depei
[J]. JOURNAL OF SUPERCOMPUTING, 2023, 79 (13): : 14172 - 14199
[7] Denial of Service in CPU-GPU Heterogeneous Architectures
Wen, Hao
Zhang, Wei
[J]. 2020 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2020,
[8] Hardware Support for Concurrent Detection of Multiple Concurrency Bugs on Fused CPU-GPU Architectures
Zhang, Weihua
Yu, Shiqiang
Wang, Haojun
Dai, Zhuofang
Chen, Haibo
[J]. IEEE TRANSACTIONS ON COMPUTERS, 2016, 65 (10) : 3083 - 3095
[9] ONLINE SCHEDULING OF MIXED CPU-GPU JOBS
Chen, Lin
Ye, Deshi
Zhang, Guochuang
[J]. INTERNATIONAL JOURNAL OF FOUNDATIONS OF COMPUTER SCIENCE, 2014, 25 (06) : 745 - 761
[10] HybridHadoop: CPU-GPU Hybrid Scheduling in Hadoop
Oh, Chanyoung
Jung, Hyeonjin
Yi, Saehanseul
Yoon, Illo
Yi, Youngmin
[J]. PROCEEDINGS OF INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING IN ASIA-PACIFIC REGION (HPC ASIA 2021), 2020, : 40 - 49

← 1 2 3 4 5 →