Porting, monitoring and tuning UPC on NUMA architectures

被引:0
|
作者
Mohamed, AS [1 ]
机构
[1] George Washington Univ, Dept Elect & Comp Engn, Washington, DC 20052 USA
关键词
parallel C; P-threads; optimization; memory consistency;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In this work we report on our experience in porting NAS NPB benchmark using the recently developed GCC-SGI UPC compiler on the Origin 03800 NUMA machine. In fact, the SGI NUMA environment has provided new opportunities for UPC For example, by coupling Unix P-threads with standard UPC threads one is able to code solutions to problems using pipelining, divide-and-conquer, and speculative parallelization styles. This task-level parallelism was never before possible in UPC that relies mainly on distributed shared memory fine-grain data parallelism. This has led to having multi-threads per processor and provided further opportunities for optimization through load balancing. The SGI CC-NUMA environment also provided memory consistency optimizations to mask the latency of remote accesses, convert aggregate accesses into more efficient bulk operations, and cache data locally. UPC allows programmers to specify memory accesses with "relaxed" consistency semantics. These explicit consistency "hints" are exploited by the CC-NUMA environment very effectively to hide latency and reduce coherence overheads further by, for example, allowing two or more processors to modify their local copies of shared data concurrently and merging modifications at synchronization points. This characteristic alleviates the effect of false sharing. Yet another opportunity that was made possible by the spectrum of performance analysis and profiler tools within the SGI NUMA environment is the development of new monitoring and tuning strategy that aims at improving the efficiency of parallel UPC applications. We are able to project the physically monitored parameters back to the data structures and high-level program constructs within the UPC source code. This increases a programmer's ability to effectively understand, develop, and optimize UPC programs; enabling an exact analysis of a program's data and code layouts. Using this visualized information, programmers are able to detect communication, data/threads layouts, and I/O bottlenecks and further optimizes UPC programs with a better data and threads layouts potentially resulting in significant performance improvements.
引用
收藏
页码:1518 / 1525
页数:8
相关论文
共 50 条
  • [31] Speculative Synchronization for Coherence-free Embedded NUMA Architectures
    Papagiannopoulou, Dimitra
    Moreshet, Tali
    Marongiu, Andrea
    Benini, Luca
    Herlihy, Maurice
    Bahar, R. Iris
    2014 INTERNATIONAL CONFERENCE ON EMBEDDED COMPUTER SYSTEMS: ARCHITECTURES, MODELING, AND SIMULATION (SAMOS XIV), 2014, : 99 - 106
  • [32] Optimising MPI tree-based communication for NUMA architectures
    Karlsson, Christer
    Chen, Zizhong
    International Journal of Autonomous and Adaptive Communications Systems, 2015, 8 (04) : 407 - 423
  • [33] Black-box Concurrent Data Structures for NUMA Architectures
    Calciu, Irina
    Sen, Siddhartha
    Balakrishnan, Mahesh
    Aguilera, Marcos K.
    OPERATING SYSTEMS REVIEW, 2017, 51 (02) : 207 - 221
  • [34] Black-box Concurrent Data Structures for NUMA Architectures
    Calciu, Irina
    Sen, Siddhartha
    Balakrishnan, Mahesh
    Aguilera, Marcos K.
    ACM SIGPLAN NOTICES, 2017, 52 (04) : 207 - 221
  • [35] Multilevel parallelism optimization of stencil computations on SIMDlized NUMA architectures
    Kaifang Zhang
    Huayou Su
    Yong Dou
    The Journal of Supercomputing, 2021, 77 : 13584 - 13600
  • [36] Data and Thread Placement in NUMA Architectures: A Statistical Learning Approach
    Denoyelle, Nicolas
    Goglin, Brice
    Jeannot, Emmanuel
    Ropars, Thomas
    PROCEEDINGS OF THE 48TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP 2019), 2019,
  • [37] Optimizing operating system performance for CC-NUMA architectures
    Chang, MS
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2003, 15 (14): : 1257 - 1274
  • [38] Black-box Concurrent Data Structures for NUMA Architectures
    Calciu, Irina
    Sen, Siddhartha
    Balakrishnan, Mahesh
    Aguilera, Marcos K.
    TWENTY-SECOND INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS (ASPLOS XXII), 2017, : 206 - 220
  • [39] Evaluation of OpenMP Task Scheduling Algorithms for Large NUMA Architectures
    Clet-Ortega, Jerome
    Carribault, Patrick
    Perache, Marc
    EURO-PAR 2014 PARALLEL PROCESSING, 2014, 8632 : 596 - 607
  • [40] Porting of parallel applications to reconfigurable computer systems with various architectures and configurations
    Dordopulo, Alexey Igorevich
    Kovalenko, Vasiliy Borisovich
    Gudkov, Viacheslav Alexandrovich
    Slasten, Liubov Mikhailovna
    2016 5TH INTERNATIONAL CONFERENCE ON INFORMATICS, ELECTRONICS AND VISION (ICIEV), 2016, : 1122 - 1127