DynAMO: Improving Parallelism Through Dynamic Placement of Atomic Memory Operations

被引:0
|
作者
Soria-Pardos, Victor [1 ]
Armejach, Adria [2 ]
Muck, Tiago [3 ]
Suarez Gracia, Dario [4 ]
Joao, Jose A. [3 ]
Rico, Alejandro [5 ]
Moreto, Miquel [2 ]
机构
[1] Barcelona Supercomp Ctr, Barcelona, Spain
[2] Univ Politecn Cataluna, Barcelona Supercomp Ctr, Barcelona, Spain
[3] Arm, Austin, TX USA
[4] Univ Zaragoza, Zaragoza, Spain
[5] AMD, Austin, TX USA
关键词
multi-core architectures; microarchitecture; atomic memory operations; data placement; BARRIER SYNCHRONIZATION; ARCHITECTURE; COMMUNICATION; SPLASH-2;
D O I
10.1145/3579371.3589065
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
With increasing core counts in modern multi-core designs, the over-head of synchronization jeopardizes the scalability and efficiency of parallel applications. To mitigate these overheads, modern cachecoherent protocols offer support for Atomic Memory Operations (AMOs) that can be executed near-core (near) or remotely in the on-chip memory hierarchy (far). This paper evaluates current available static AMO execution policies implemented in multi-core Systems-on-Chip (SoC) designs, which select AMOs' execution placement (near or far) based on the cache block coherence state. We propose three static policies and show that the performance of static policies is application dependent. Moreover, we show that one of our proposed static policies outperforms currently available implementations. Furthermore, we propose DynAMO, a predictor that selects the best location to execute the AMOs. DynAMO identifies the different locality patterns to make informed decisions, improving AMO latency and increasing overall throughput. DynAMO outperforms the best-performing static policy and provides geometric mean speed-ups of 1.09x across all workloads and 1.31x on AMO-intensive applications with respect to executing all AMOs near.
引用
收藏
页码:420 / 432
页数:13
相关论文
共 50 条
  • [31] Extracting Memory-Level Parallelism through Reconfigurable Hardware Traces
    Lin, Mingjie
    Cheng, Shaoyi
    Wawrzynek, John
    2013 INTERNATIONAL CONFERENCE ON RECONFIGURABLE COMPUTING AND FPGAS (RECONFIG), 2013,
  • [32] Simulation and Architecture Improvements of Atomic Operations on GPU Scratchpad Memory
    van den Braak, Gert-Jan
    Gomez-Luna, Juan
    Corporaal, Henk
    Gonzalez-Linares, Jose Maria
    Guil, Nicolas
    2013 IEEE 31ST INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD), 2013, : 357 - 362
  • [33] Code placement for improving dynamic branch prediction accuracy
    Jiménez, DA
    ACM SIGPLAN NOTICES, 2005, 40 (06) : 107 - 116
  • [34] Improving the Robustness of Reservoir Operations with Stochastic Dynamic Programming
    Kim, Gi Joo
    Kim, Young-Oh
    Reed, Patrick M.
    JOURNAL OF WATER RESOURCES PLANNING AND MANAGEMENT, 2021, 147 (07)
  • [35] DYNAMO - A PORTABLE TOOL FOR DYNAMIC LOAD BALANCING ON DISTRIBUTED-MEMORY MULTICOMPUTERS
    TARNVIK, E
    CONCURRENCY-PRACTICE AND EXPERIENCE, 1994, 6 (08): : 613 - 639
  • [36] BLPP: Improving the Performance of GPGPUs with Heterogeneous Memory through Bandwidth- and Latency-Aware Page Placement
    Kim, Kyu Yeun
    Baek, Woongki
    2018 IEEE 36TH INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD), 2018, : 358 - 365
  • [37] Exploiting mixed-mode parallelism for matrix operations on the HERA architecture through reconfiguration
    Wang, X.
    Ziavras, S. G.
    IEE PROCEEDINGS-COMPUTERS AND DIGITAL TECHNIQUES, 2006, 153 (04): : 249 - 260
  • [38] Dynamic application placement under service and memory constraints
    Kimbrel, T
    Steinder, M
    Sviridenko, M
    Tantawi, A
    EXPERIMENTAL AND EFFICIENT ALGORITHMS, PROCEEDINGS, 2005, 3503 : 391 - 402
  • [39] Performance evaluation of unified memory and dynamic parallelism for selected parallel CUDA applications
    Łukasz Jarząbek
    Paweł Czarnul
    The Journal of Supercomputing, 2017, 73 : 5378 - 5401
  • [40] Performance evaluation of unified memory and dynamic parallelism for selected parallel CUDA applications
    Jarzabek, Lukasz
    Czarnul, Pawel
    JOURNAL OF SUPERCOMPUTING, 2017, 73 (12): : 5378 - 5401