Exploring data flow design and vectorization with oneAPI for streaming applications on CPU plus GPU

被引:0
|
作者
Campos, Cristian [1 ]
Asenjo, Rafael [1 ]
Navarro, Angeles [1 ]
机构
[1] Univ Malaga, Dept Comp Architecture, Malaga 29071, Malaga, Spain
来源
JOURNAL OF SUPERCOMPUTING | 2025年 / 81卷 / 02期
关键词
Streaming applications; Heterogeneous computing; Analytical model; Queue theory; CPU plus GPU; oneAPI; SYCL;
D O I
10.1007/s11227-024-06891-3
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In recent times, oneAPI has emerged as a competitive framework to optimize streaming applications on heterogeneous CPU+GPU architectures, since it provides portability and performance thanks to the SYCL programming language and efficient parallel libraries as oneTBB. However, this approach opens up a wealth of implementations alternatives in this type of applications: from how to design the data flow to how to exploit data parallelism. Choosing the best alternative is not trivial, so in this paper we analyze them and contribute with an analytical model based on queue theory that helps in the on-line selection of the alternative that maximizes the throughput and the occupancy of the CPU and GPU compute units. We explore the design space offered by: a) different APIs to define the data flow (parallel_pipeline and Flow Graph from oneTBB, and SYCL events from SYCL); b) alternative kernel implementations to express data parallelism (SYCL, AVX and std::simd); and c) the mapping of the kernels into the available computing resources (CPU cores and GPU). The results show that the std::simd library can be 1.54x faster, 3% more energy efficient, and requires 7.36x less programming effort than AVX, and that implementations that enable asynchronous offloading of tasks to the devices as those based on SYCL events and Flow Graph APIs outperform the other APIs, being up to 1.10x faster and up to 1.18x more energy efficient.
引用
收藏
页数:30
相关论文
共 36 条
  • [1] Determining a Device Crossover Point in CPU/GPU Systems for Streaming Applications
    Kanur, Sudeep
    Lund, Wictor
    Tsiopoulos, Leonidas
    Lilius, Johan
    2015 IEEE GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING (GLOBALSIP), 2015, : 1417 - 1421
  • [2] A Graphics Tracing Framework for Exploring CPU plus GPU Memory Systems
    Sembrant, Andreas
    Carlson, Trevor E.
    Hagersten, Erik
    Black-Schaffer, David
    PROCEEDINGS OF THE 2017 IEEE INTERNATIONAL SYMPOSIUM ON WORKLOAD CHARACTERIZATION (IISWC), 2017, : 54 - 65
  • [3] Study of GPU and CPU Collective in Matching Data Flow
    Zhang, Chi
    Huang, Dongmei
    CEIS 2011, 2011, 15
  • [4] Mapping Streaming Applications on Commodity Multi-CPU and GPU On-Chip Processors
    Vilches, Antonio
    Navarro, Angeles
    Asenjo, Rafael
    Corbera, Francisco
    Gran, Ruben
    Garzaran, Maria J.
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2016, 27 (04) : 1099 - 1115
  • [5] Pareto Efficient Design for Reconfigurable Streaming Applications on CPU/FPGAs
    Zhu, Jun
    Sander, Ingo
    Jantsch, Axel
    2010 DESIGN, AUTOMATION & TEST IN EUROPE (DATE 2010), 2010, : 1035 - 1040
  • [6] Automatic Data Layout Generation and Kernel Mapping for CPU plus GPU Architectures
    Majeti, Deepak
    Meel, Kuldeep S.
    Barik, Rajkishore
    Sarkar, Vivek
    PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON COMPILER CONSTRUCTION (CC 2016), 2016, : 240 - 250
  • [7] Exploring Heterogeneous NoC Design Space in Heterogeneous GPU-CPU Architectures
    Fang, Juan
    Leng, Zhen-Yu
    Liu, Si-Tong
    Yao, Zhi-Cheng
    Sui, Xiu-Feng
    JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2015, 30 (01) : 74 - 83
  • [8] Exploring Heterogeneous NoC Design Space in Heterogeneous GPU-CPU Architectures
    Juan Fang
    Zhen-Yu Leng
    Si-Tong Liu
    Zhi-Cheng Yao
    Xiu-Feng Sui
    Journal of Computer Science and Technology, 2015, 30 : 74 - 83
  • [9] Benchmarking data and compute intensive applications on modern CPU and GPU architectures
    Ciznicki, Milosz
    Kierzynka, Michal
    Kopta, Piotr
    Kurowski, Krzysztof
    Gepner, Pawel
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE, ICCS 2012, 2012, 9 : 1900 - 1909
  • [10] Automatic CPU/GPU Generation of Multi-versioned OpenCL Kernels for C plus plus Scientific Applications
    Sotomayor, Rafael
    Miguel Sanchez, Luis
    Garcia Blas, Javier
    Fernandez, Javier
    Daniel Garcia, J.
    INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2017, 45 (02) : 262 - 282