Co-Scheduling on Fused CPU-GPU Architectures With Shared Last Level Caches

被引:7
|
作者
Damschen, Marvin [1 ]
Mueller, Frank [2 ]
Henkel, Joerg [1 ]
机构
[1] Karlsruhe Inst Technol, Chair Embedded Syst, D-76131 Karlsruhe, Germany
[2] North Carolina State Univ, Dept Comp Sci, Raleigh, NC 27695 USA
基金
美国国家科学基金会;
关键词
Heterogeneous computing; integrated architecture; performance tuning; scheduling;
D O I
10.1109/TCAD.2018.2857042
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Fused CPU-GPU architectures integrate a CPU and general-purpose GPU on a single die. Recent fused architectures even share the last level cache (LLC) between CPU and GPU. This enables hardware-supported byte-level coherency. Thus, CPU and GPU can execute computational kernels collaboratively, but novel methods to co-schedule work are required. This paper contributes three dynamic co-scheduling methods. Two of our methods implement workers that autonomously acquire work from a common set of independent work items (similar to bag-of-tasks scheduling). The third method, host-side profiling, uses a fraction of the total work of a kernel to determine a ratio of how to distribute work to CPU and GPU based on profiling. The resulting ratio is used for the following executions of the same kernel. Our methods are realized using OpenCL 2.0, which introduces fine-grained shared virtual memory (SVM) to allocate coherent memory between CPU and GPU. We port the Rodinia Benchmark Suite, a standard suite for heterogeneous computing, to fine-grained SVM and fused CPU-GPU architectures (Rodinia-SVM). We evaluate the overhead of fine-grained SVM and analyze the suitability of OpenCL 2.0's new features for co-scheduling. Our host-side profiling method performs competitively to the optimal choice of executing kernels either on CPU or GPU (hypothetical xor-Oracle). On average, it achieves 97% of xor-Oracle's performance and a 1.43x speedup over using the GPU alone (standard in Rodinia). We show, however, that in most cases it is not beneficial to split the work of a kernel between CPU and GPU compared to exclusively running it on the most suitable single compute device. For a fixed amount of work per device, cache-related stalls can increase by up to 1.75x when both devices are used in parallel instead of exclusively while cache misses remain the same. Thus, not the cost of cache conflicts, but inefficient cache coherence is a major performance bottleneck for current fused CPU-GPU Intel architectures with shared LLC.
引用
收藏
页码:2337 / 2347
页数:11
相关论文
共 50 条
  • [1] CPU-Assisted GPGPU on Fused CPU-GPU Architectures
    Yang, Yi
    Xiang, Ping
    Mantor, Mike
    Zhou, Huiyang
    [J]. 2012 IEEE 18TH INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA), 2012, : 103 - 114
  • [2] WCET Analysis of the Shared Data Cache in Integrated CPU-GPU Architectures
    Huangfu, Yijie
    Zhang, Wei
    [J]. 2017 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2017,
  • [3] Orchestrated Co-scheduling, Resource Partitioning, and Power Capping on CPU-GPU Heterogeneous Systems via Machine Learning
    Saba, Issa
    Arima, Eishi
    Liu, Dai
    Schulz, Martin
    [J]. ARCHITECTURE OF COMPUTING SYSTEMS, ARCS 2022, 2022, 13642 : 51 - 67
  • [4] Selective GPU Caches to Eliminate CPU-GPU HW Cache Coherence
    Agarwal, Neha
    Nellans, David
    Ebrahimi, Eiman
    Wenisch, Thomas F.
    Danskin, John
    Keckler, Stephen W.
    [J]. PROCEEDINGS OF THE 2016 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE (HPCA-22), 2016, : 494 - 506
  • [5] SRAM- and STT-RAM-based hybrid, shared last-level cache for on-chip CPU-GPU heterogeneous architectures
    Gao, Lan
    Wang, Rui
    Xu, Yunlong
    Yang, Hailong
    Luan, Zhongzhi
    Qian, Depei
    Zhang, Han
    Cai, Jihong
    [J]. JOURNAL OF SUPERCOMPUTING, 2018, 74 (07): : 3388 - 3414
  • [6] CoFB: latency-constrained co-scheduling of flows and batches for deep learning inference service on the CPU-GPU system
    Zhang, Qi
    Liu, Yi
    Liu, Tao
    Qian, Depei
    [J]. JOURNAL OF SUPERCOMPUTING, 2023, 79 (13): : 14172 - 14199
  • [7] Denial of Service in CPU-GPU Heterogeneous Architectures
    Wen, Hao
    Zhang, Wei
    [J]. 2020 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2020,
  • [8] Hardware Support for Concurrent Detection of Multiple Concurrency Bugs on Fused CPU-GPU Architectures
    Zhang, Weihua
    Yu, Shiqiang
    Wang, Haojun
    Dai, Zhuofang
    Chen, Haibo
    [J]. IEEE TRANSACTIONS ON COMPUTERS, 2016, 65 (10) : 3083 - 3095
  • [9] ONLINE SCHEDULING OF MIXED CPU-GPU JOBS
    Chen, Lin
    Ye, Deshi
    Zhang, Guochuang
    [J]. INTERNATIONAL JOURNAL OF FOUNDATIONS OF COMPUTER SCIENCE, 2014, 25 (06) : 745 - 761
  • [10] HybridHadoop: CPU-GPU Hybrid Scheduling in Hadoop
    Oh, Chanyoung
    Jung, Hyeonjin
    Yi, Saehanseul
    Yoon, Illo
    Yi, Youngmin
    [J]. PROCEEDINGS OF INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING IN ASIA-PACIFIC REGION (HPC ASIA 2021), 2020, : 40 - 49