Improving the Utilization of Micro-operation Caches in x86 Processors

被引:6
|
作者
Kotra, Jagadish B. [1 ]
Kalamatianos, John [1 ]
机构
[1] AMD Res, Austin, TX 78735 USA
关键词
Micro-operations Cache; CPU front-end; CISC; X86; Micro-ops; COMPRESSION;
D O I
10.1109/MICRO50266.2020.00025
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Most modern processors employ variable length, Complex Instruction Set Computing (CISC) instructions to reduce instruction fetch energy cost and bandwidth requirements. High throughput decoding of CISC instructions requires energy hungry logic for instruction identification. Efficient CISC instruction execution motivated mapping them to fixed length micro-operations (also known as uops). To reduce costly decoder activity, commercial CISC processors employ a micro-operations cache (uop cache) that caches uop sequences, bypassing the decoder. Uop cache's benefits are: (1) shorter pipeline length for uops dispatched by the uop cache, (2) lower decoder energy consumption, and, (3) earlier detection of mispredicted branches. In this paper, we observe that a uop cache can be heavily fragmented under certain uop cache entry construction rules. Based on this observation, we propose two complementary optimizations to address fragmentation: Cache Line boundary AgnoStic uoP cache design (CLASP) and uop cache compaction. CLASP addresses the internal fragmentation caused by short, sequential uop sequences, terminated at the I-cache line boundary, by fusing them into a single uop cache entry. Compaction further lowers fragmentation by placing to the same uop cache entry temporally correlated, non-sequential uop sequences mapped to the same uop cache set. Our experiments on a x86 simulator using a wide variety of benchmarks show that CLASP improves performance up to 5.6% and lowers decoder power up to 19.63%. When CLASP is coupled with the most aggressive compaction variant, performance improves by up to 12.8% and decoder power savings are up to 31.53%.
引用
收藏
页码:160 / 172
页数:13
相关论文
共 45 条
  • [1] UC-Check: Characterizing Micro-operation Caches in x86 Processors and Implications in Security and Performance
    Kim, Joonsung
    Jang, Hamin
    Lee, Hunjun
    Lee, Seungho
    Kim, Jangwoo
    PROCEEDINGS OF 54TH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE, MICRO 2021, 2021, : 550 - 564
  • [2] X86 PROCESSORS
    Dipert, Brian
    EDN, 2010, 55 (03) : 19 - +
  • [3] Fast Concurrent Queues for x86 Processors
    Morrison, Adam
    Afek, Yehuda
    ACM SIGPLAN NOTICES, 2013, 48 (08) : 103 - 112
  • [4] A SUCCESSFUL DESIGN METHODOLOGY FOR X86 PROCESSORS
    不详
    ELECTRONIC ENGINEERING, 1993, 65 (801): : S39 - S41
  • [5] Optimizing precision overhead for x86 processors
    Ogasawara, T
    Komatsu, H
    Nakatani, T
    SOFTWARE-PRACTICE & EXPERIENCE, 2004, 34 (09): : 875 - 893
  • [6] MPTLsim: A Simulator for X86 Multicore Processors
    Zeng, Hui
    Yourst, Matt
    Ghose, Kanad
    Ponomarev, Dmitry
    DAC: 2009 46TH ACM/IEEE DESIGN AUTOMATION CONFERENCE, VOLS 1 AND 2, 2009, : 226 - 231
  • [7] Optimizing precision overhead for x86 processors
    Ogasawara, T
    Komatsu, H
    Nakatani, T
    USENIX ASSOCIATION PROCEEDINGS OF THE 2ND JAVA(TM) VIRTUAL MACHINE RESEARCH AND TECHNOLOGY SYMPOSIUM, 2002, : 41 - 50
  • [8] A Statistical Approach to Power Estimation for x86 Processors
    Chadha, Mohak
    Ilsche, Thomas
    Bielert, Mario
    Nagel, Wolfgang E.
    2017 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2017, : 1012 - 1019
  • [9] Investigating Cache Parameters of x86 Family Processors
    Babka, Vlastimil
    Tuma, Petr
    COMPUTER PERFORMANCE EVALUATION AND BENCHMARKING, PROCEEDINGS, 2009, 5419 : 77 - 96
  • [10] CVR: Efficient Vectorization of SpMV on X86 Processors
    Xie, Biwei
    Zhan, Jianfeng
    Liu, Xu
    Gao, Wanling
    Jia, Zhen
    He, Xiwen
    Zhang, Lixin
    PROCEEDINGS OF THE 2018 INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION (CGO'18), 2018, : 149 - 162