Porting, monitoring and tuning UPC on NUMA architectures

被引:0
|
作者
Mohamed, AS [1 ]
机构
[1] George Washington Univ, Dept Elect & Comp Engn, Washington, DC 20052 USA
关键词
parallel C; P-threads; optimization; memory consistency;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In this work we report on our experience in porting NAS NPB benchmark using the recently developed GCC-SGI UPC compiler on the Origin 03800 NUMA machine. In fact, the SGI NUMA environment has provided new opportunities for UPC For example, by coupling Unix P-threads with standard UPC threads one is able to code solutions to problems using pipelining, divide-and-conquer, and speculative parallelization styles. This task-level parallelism was never before possible in UPC that relies mainly on distributed shared memory fine-grain data parallelism. This has led to having multi-threads per processor and provided further opportunities for optimization through load balancing. The SGI CC-NUMA environment also provided memory consistency optimizations to mask the latency of remote accesses, convert aggregate accesses into more efficient bulk operations, and cache data locally. UPC allows programmers to specify memory accesses with "relaxed" consistency semantics. These explicit consistency "hints" are exploited by the CC-NUMA environment very effectively to hide latency and reduce coherence overheads further by, for example, allowing two or more processors to modify their local copies of shared data concurrently and merging modifications at synchronization points. This characteristic alleviates the effect of false sharing. Yet another opportunity that was made possible by the spectrum of performance analysis and profiler tools within the SGI NUMA environment is the development of new monitoring and tuning strategy that aims at improving the efficiency of parallel UPC applications. We are able to project the physically monitored parameters back to the data structures and high-level program constructs within the UPC source code. This increases a programmer's ability to effectively understand, develop, and optimize UPC programs; enabling an exact analysis of a program's data and code layouts. Using this visualized information, programmers are able to detect communication, data/threads layouts, and I/O bottlenecks and further optimizes UPC programs with a better data and threads layouts potentially resulting in significant performance improvements.
引用
收藏
页码:1518 / 1525
页数:8
相关论文
共 50 条
  • [41] Hybrid-Grained Dynamic Load Balanced GEMM on NUMA Architectures
    Su, Xing
    Lei, Fei
    ELECTRONICS, 2018, 7 (12):
  • [42] Visualizing the memory access behavior of shared memory applications on NUMA architectures
    Tao, J
    Karl, W
    Schulz, M
    COMPUTATIONAL SCIENCE -- ICCS 200, PROCEEDINGS PT 2, 2001, 2074 : 861 - 870
  • [43] Affinity-On-Next-Touch: An Extension to the Linux Kernel for NUMA Architectures
    Lankes, Stefan
    Bierbaum, Boris
    Bemmerl, Thomas
    PARALLEL PROCESSING AND APPLIED MATHEMATICS, PT I, 2010, 6067 : 576 - 585
  • [44] Reducing energy cost of multi-threaded programs on NUMA architectures
    Fang H.
    Zhu L.
    Li X.
    Zhu, Liang (lemonsprite@qq.com), 2018, Totem Publishers Ltd (14) : 1201 - 1212
  • [45] Evaluation of memory performance in NUMA architectures using Stochastic Reward Nets
    Entezari-Maleki, Reza
    Cho, Younghyun
    Egger, Bernhard
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2020, 144 : 172 - 188
  • [46] Automatically Optimizing Stencil Computations on Many-Core NUMA Architectures
    Lin, Pei-Hung
    Yi, Qing
    Quinlan, Daniel
    Liao, Chunhua
    Yan, Yongqing
    LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING, LCPC 2016, 2017, 10136 : 137 - 152
  • [47] AdaptMD: Balancing Space and Performance in NUMA Architectures With Adaptive Memory Deduplication
    Yao, Lulu
    Li, Yongkun
    Lee, Patrick P. C.
    Wang, Xiaoyang
    Xu, Yinlong
    IEEE TRANSACTIONS ON COMPUTERS, 2024, 73 (06) : 1588 - 1602
  • [48] Near-Optimal Placement of MPI Processes on Hierarchical NUMA Architectures
    Jeannot, Emmanuel
    Mercier, Guillaume
    EURO-PAR 2010 - PARALLEL PROCESSING, PART II, 2010, 6272 : 199 - 210
  • [49] PGASUS: A Framework for C plus plus Application Development on NUMA Architectures
    Hagen, Wieland
    Plauth, Max
    Eberhardt, Felix
    Feinbube, Frank
    Polze, Andreas
    2016 FOURTH INTERNATIONAL SYMPOSIUM ON COMPUTING AND NETWORKING (CANDAR), 2016, : 368 - 374
  • [50] A performance comparison of data and memory allocation strategies for sequence aligners on NUMA architectures
    Josefina Lenis
    Miquel Angel Senar
    Cluster Computing, 2017, 20 : 1909 - 1924