Adaptive Runtime-Assisted Block Prefetching on Chip-Multiprocessors

被引:0
|
作者
Garcia, Victor [1 ,2 ]
Rico, Alejandro [2 ]
Villavieja, Carlos [3 ]
Carpenter, Paul [2 ]
Navarro, Nacho [1 ,2 ]
Ramirez, Alex [4 ]
机构
[1] Univ Politecn Cataluna, Barcelona, Spain
[2] Barcelona Supercomp Ctr, Barcelona, Spain
[3] Google Inc, New York, NY USA
[4] NVIDIA Corp, Santa Clara, CA USA
关键词
Cache memories; Prefetch; Task based programming models;
D O I
10.1007/s10766-016-0431-8
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Memory stalls are a significant source of performance degradation in modern processors. Data prefetching is a widely adopted and well studied technique used to alleviate this problem. Prefetching can be performed by the hardware, or be initiated and controlled by software. Among software controlled prefetching we find a wide variety of schemes, including runtime-directed prefetching and more specifically runtime-directed block prefetching. This paper proposes a hybrid prefetching mechanism that integrates a software driven block prefetcher with existing hardware prefetching techniques. Our runtime-assisted software prefetcher brings large blocks of data on-chip with the support of a low cost hardware engine, and synergizes with existing hardware prefetchers that manage locality at a finer granularity. The runtime system that drives the prefetch engine dynamically selects which cache to prefetch to. Our evaluation on a set of scientific benchmarks obtains a maximum speed up of 32 and 10 % on average compared to a baseline with hardware prefetching only. As a result, we also achieve a reduction of up to 18 and 3 % on average in energy-to-solution.
引用
收藏
页码:530 / 550
页数:21
相关论文
共 50 条
  • [41] Deploying Hard Real-time Control Software on Chip-multiprocessors
    Bui, Dai N.
    Patel, Hiren D.
    Lee, Edward A.
    16TH IEEE INTERNATIONAL CONFERENCE ON EMBEDDED AND REAL-TIME COMPUTING SYSTEMS AND APPLICATIONS (RTCSA 2010), 2010, : 283 - 292
  • [42] Runtime-Assisted Cache Coherence Deactivation in Task Parallel Programs
    Caheny, Paul
    Alvarez, Lluc
    Valero, Mateo
    Moreto, Miquel
    Casas, Marc
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE, AND ANALYSIS (SC'18), 2018,
  • [43] Safety-critical Java']Java with cyclic executives on chip-multiprocessors
    Ravn, Anders P.
    Schoeberl, Martin
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2012, 24 (08): : 772 - 788
  • [44] Effective instruction prefetching in chip multiprocessors for modern commercial applications
    Spracklen, L
    Chou, Y
    Abraham, SG
    11TH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE, PROCEEDINGS, 2005, : 225 - 236
  • [45] Efficient Mapping of Applications for Future Chip-Multiprocessors in Dark Silicon Era
    Hoveida, Mohaddeseh
    Aghaaliakbari, Fatemeh
    Bashizade, Ramin
    Arjomand, Mohammad
    Sarbazi-Azad, Hamid
    ACM TRANSACTIONS ON DESIGN AUTOMATION OF ELECTRONIC SYSTEMS, 2017, 22 (04)
  • [46] A Practical Data Classification Framework for Scalable and High Performance Chip-Multiprocessors
    Li, Yong
    Melhem, Rami
    Jones, Alex K.
    IEEE TRANSACTIONS ON COMPUTERS, 2014, 63 (12) : 2905 - 2918
  • [47] A Compiler-Directed Data Prefetching Scheme for Chip Multiprocessors
    Son, Seung Woo
    Kandemir, Mahmut
    Karakoy, Mustafa
    Chakrabarti, Dhruva
    ACM SIGPLAN NOTICES, 2009, 44 (04) : 209 - 218
  • [48] Fast Runtime Block Cyclic Data Redistribution on Multiprocessors
    J. Parallel Distrib. Comput., 1 (63-72):
  • [49] Fast runtime block cyclic data redistribution on multiprocessors
    Prylli, L
    Tourancheau, B
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1997, 45 (01) : 63 - 72
  • [50] Temperature-aware submesh allocation scheme for heat balancing on chip-multiprocessors
    Liao, Xiongfei
    Jigang, Wu
    Srikanthan, Thambipillai
    2007 IEEE INTERNATIONAL CONFERENCE ON APPLICATION-SPECIFIC SYSTEMS, ARCHITECTURES, AND PROCESSORS, 2007, : 228 - 233