Improving the Utilization of Micro-operation Caches in x86 Processors

被引：6

作者：

Kotra, Jagadish B. ^{[1
]}

Kalamatianos, John ^{[1
]}

机构：

[1] AMD Res, Austin, TX 78735 USA

来源：

2020 53RD ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO 2020) | 2020年

关键词：

Micro-operations Cache; CPU front-end; CISC; X86; Micro-ops; COMPRESSION;

D O I：

10.1109/MICRO50266.2020.00025

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Most modern processors employ variable length, Complex Instruction Set Computing (CISC) instructions to reduce instruction fetch energy cost and bandwidth requirements. High throughput decoding of CISC instructions requires energy hungry logic for instruction identification. Efficient CISC instruction execution motivated mapping them to fixed length micro-operations (also known as uops). To reduce costly decoder activity, commercial CISC processors employ a micro-operations cache (uop cache) that caches uop sequences, bypassing the decoder. Uop cache's benefits are: (1) shorter pipeline length for uops dispatched by the uop cache, (2) lower decoder energy consumption, and, (3) earlier detection of mispredicted branches. In this paper, we observe that a uop cache can be heavily fragmented under certain uop cache entry construction rules. Based on this observation, we propose two complementary optimizations to address fragmentation: Cache Line boundary AgnoStic uoP cache design (CLASP) and uop cache compaction. CLASP addresses the internal fragmentation caused by short, sequential uop sequences, terminated at the I-cache line boundary, by fusing them into a single uop cache entry. Compaction further lowers fragmentation by placing to the same uop cache entry temporally correlated, non-sequential uop sequences mapped to the same uop cache set. Our experiments on a x86 simulator using a wide variety of benchmarks show that CLASP improves performance up to 5.6% and lowers decoder power up to 19.63%. When CLASP is coupled with the most aggressive compaction variant, performance improves by up to 12.8% and decoder power savings are up to 31.53%.

引用

页码：160 / 172

页数：13

共 45 条

[41] Implementation and Testing of a Real-Time Software-Based GPS Receiver for x86 Processors
Charkhandeh, Shahin
Petovello, M. G.
Watson, R.
Lachapelle, G.
PROCEEDINGS OF THE 2006 NATIONAL TECHNICAL MEETING OF THE INSTITUTE OF NAVIGATION - NTM 2006, 2006, : 927 - 934
[42] Practical Mitigations for Timing-Based Side-Channel Attacks on Modern x86 Processors
Coppens, Bart
Verbauwhede, Ingrid
De Bosschere, Koen
De Sutter, Bjorn
PROCEEDINGS OF THE 2009 30TH IEEE SYMPOSIUM ON SECURITY AND PRIVACY, 2009, : 45 - +
[43] Performance Testing of a Real-Time Software-Based GPS Receiver for x86 Processors
Charkhandeh, Shahin
Petovello, M. G.
Lachapelle, G.
PROCEEDINGS OF THE 19TH INTERNATIONAL TECHNICAL MEETING OF THE SATELLITE DIVISION OF THE INSTITUTE OF NAVIGATION (ION GNSS 2006), 2006, : 2313 - 2320
[44] High-Throughput FFT-SPA Decoder Implementation for Non-Binary LDPC Codes on x86 Multicore Processors
Le Gal, Bertrand
Jego, Christophe
JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2020, 92 (01): : 37 - 53
[45] High-Throughput FFT-SPA Decoder Implementation for Non-Binary LDPC Codes on x86 Multicore Processors
Bertrand Le Gal
Christophe Jego
Journal of Signal Processing Systems, 2020, 92 : 37 - 53

← 1 2 3 4 5 →