Revolver: Processor Architecture for Power Efficient Loop Execution

被引：0

作者：

Hayenga, Mitchell ^{[1
,2
]}

Naresh, Vignyan Reddy Kothinti ^{[2
]}

Lipasti, Mikko H. ^{[2
]}

机构：

[1] ARM Inc, Cambridge, England

[2] Univ Wisconsin, Madison, WI 53706 USA

来源：

2014 20TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA-20) | 2014年

关键词：

CACHE;

D O I：

暂无

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

With the rise of mobile and cloud-based computing, modern processor design has become the task of achieving maximum power efficiency at specific performance targets. This trend, coupled with dwindling improvements in single-threaded performance, has led architects to predominately focus on energy efficiency. In this paper we note that for the majority of benchmarks, a substantial portion of execution time is spent executing simple loops. Capitalizing on the frequency of loops, we design an out-of-order processor architecture that achieves an aggressive level of performance while minimizing the energy consumed during the execution of loops. The Revolver architecture achieves energy efficiency during loop execution by enabling "in-place execution" of loops within the processor's out-of-order backend. Essentially, a few static instances of each loop instruction are dispatched to the out-of-order execution core by the processor frontend. The static instruction instances may each be executed multiple times in order to complete all necessary loop iterations. During loop execution the processor frontend, including instruction fetch, branch prediction, decode, allocation, and dispatch logic, can be completely clock gated. Additionally we propose a mechanism to pre-execute future loop iteration load instructions, thereby realizing parallelism beyond the loop iterations currently executing within the processor core. Employing Revolver across three benchmark suites, we eliminate 20, 55, and 84% of all frontend instruction dispatches. Overall, we find Revolver maintains performance, while resulting in 5.3%-18.3% energy-delay benefit over loop buffers or micro-op cache techniques alone.

引用

页码：591 / 602

页数：12

共 50 条

[1] Power efficient processor architecture and the cell processor
Hofstee, HP
[J]. 11TH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE, PROCEEDINGS, 2005, : 258 - 262
[2] THE FUNCTION PROCESSOR - AN ARCHITECTURE FOR EFFICIENT EXECUTION OF RECURSIVE FUNCTIONS
VASELL, J
VASELL, J
[J]. LECTURE NOTES IN COMPUTER SCIENCE, 1991, 505 : 101 - 118
[3] Parallel In-Order Execution Architecture for Low-Power Processor
Lee, Kyungmin
Jeong, Ipoom
Ro, Won Woo
[J]. PROCEEDINGS INTERNATIONAL SOC DESIGN CONFERENCE 2017 (ISOCC 2017), 2017, : 65 - 66
[4] Dual-execution mode processor architecture
Akanda, Md. Musfiquzzaman
Abderazek, Ben A.
Sowa, Masahiro
[J]. JOURNAL OF SUPERCOMPUTING, 2008, 44 (02): : 103 - 125
[5] Dual-execution mode processor architecture
Md. Musfiquzzaman Akanda
Ben A. Abderazek
Masahiro Sowa
[J]. The Journal of Supercomputing, 2008, 44 : 103 - 125
[6] Power-efficient flexible processor architecture for embedded applications
Vermeulen, F
Catthoor, F
Nachtergaele, L
Verkest, D
De Man, H
[J]. IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2003, 11 (03) : 376 - 385
[7] A multithreaded architecture for the efficient execution of vector computations within a loop using status field
Youn, SD
Chung, KD
[J]. 3RD INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, PROCEEDINGS, 1996, : 343 - 350
[8] Efficient Loop Navigation for Symbolic Execution
Obdrzalek, Jan
Trtik, Marek
[J]. AUTOMATED TECHNOLOGY FOR VERIFICATION AND ANALYSIS, 2011, 6996 : 453 - 462
[9] PARALLEL-LOOP-EXECUTION TECHNOLOGY FOR IMPLEMENTATION ON VECTOR PROCESSOR
LUKINOVA, OV
[J]. CYBERNETICS AND SYSTEMS ANALYSIS, 1993, 29 (02) : 247 - 249
[10] LPA A first approach to the loop processor architecture
Garcia, Alejandro
Santana, Oliverio J.
Fernandez, Enrique
Medina, Pedro
Valero, Mateo
[J]. HIGH PERFORMANCE EMBEDDED ARCHITECTURES AND COMPILERS, 2008, 4917 : 273 - +

← 1 2 3 4 5 →