A Collaborative CPU Vector Offloader: Putting Idle Vector Resources to Work on Commodity Processors

被引：0

作者：

Son, Youngbin ^{[1
]}

Kang, Seokwon ^{[2
]}

Um, Hongjun ^{[2
]}

Lee, Seokho ^{[1
]}

Ham, Jonghyun ^{[2
]}

Kim, Donghyeon ^{[2
]}

Park, Yongjun ^{[1
,2
]}

机构：

[1] Hanyang Univ, Dept Artificial Intelligence, Seoul 04763, South Korea

[2] Hanyang Univ, Dept Comp Sci, Seoul 04763, South Korea

来源：

ELECTRONICS | 2021年 / 10卷 / 23期

基金：

新加坡国家研究基金会;

关键词：

vector processors; job offloading; resource utilization; data parallelism; heterogeneous system architectures;

D O I：

10.3390/electronics10232960

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Most modern processors contain a vector accelerator or internal vector units for the fast computation of large target workloads. However, accelerating applications using vector units is difficult because the underlying data parallelism should be uncovered explicitly using vector-specific instructions. Therefore, vector units are often underutilized or remain idle because of the challenges faced in vector code generation. To solve this underutilization problem of existing vector units, we propose the Vector Offloader for executing scalar programs, which considers the vector unit as a scalar operation unit. By using vector masking, an appropriate partition of the vector unit can be utilized to support scalar instructions. To efficiently utilize all execution units, including the vector unit, the Vector Offloader suggests running the target applications concurrently in both the central processing unit (CPU) and the decoupled vector units, by offloading some parts of the program to the vector unit. Furthermore, a profile-guided optimization technique is employed to determine the optimal offloading ratio for balancing the load between the CPU and the vector unit. We implemented the Vector Offloader on a RISC-V infrastructure with a Hwacha vector unit, and evaluated its performance using a Polybench benchmark set. Experimental results showed that the proposed technique achieved performance improvements up to 1.31x better than the simple, CPU-only execution on a field programmable gate array (FPGA)-level evaluation.

引用

页数：15

共 2 条

[1] Resource Recycling: Putting Idle Resources to Work on a Composable Accelerator
Park, Yongjun
Park, Hyunchul
Mahlke, Scott
Kim, Sukjin
[J]. PROCEEDINGS OF THE 2010 INTERNATIONAL CONFERENCE ON COMPILERS, ARCHITECTURES AND SYNTHESIS FOR EMBEDDED SYSTEMS (CASES '10), 2010, : 21 - 30
[2] DSP integrates four vector processors, 24-bit CPU
不详
[J]. COMPUTER DESIGN, 1996, 35 (01): : 124 - 125

← 1 →