Optimizations in a high-performance conjugate gradient benchmark for IA-based multi- and many-core processors

被引:8
|
作者
Park, Jongsoo [1 ]
Smelyanskiy, Mikhail [1 ]
Vaidyanathan, Karthikeyan [2 ]
Heinecke, Alexander [1 ]
Kalamkar, Dhiraj D. [2 ]
Patwary, Md Mosotofa Ali [1 ]
Pirogov, Vadim [3 ]
Dubey, Pradeep [1 ]
Liu, Xing [4 ]
Rosales, Carlos [5 ]
Mazauric, Cyril [6 ]
Daley, Christopher [7 ]
机构
[1] Intel Corp, Parallel Comp Lab, 2200 Mission Coll Blvd, Santa Clara, CA 95051 USA
[2] Intel Corp, Parallel Comp Lab, Bangalore, Karnataka, India
[3] Intel Corp, Software & Serv Grp, Moscow, Russia
[4] IBM Res, TJ Watson Res Ctr, Richmond, VA USA
[5] Univ Texas Austin, Texas Adv Comp Ctr, Austin, TX 78712 USA
[6] Applicat & Performance Team, Bull, France
[7] Lawrence Berkeley Natl Lab, Natl Energy Res Sci Comp Ctr, Berkeley, CA USA
关键词
High-performance conjugate gradient; HPCG; conjugate gradient; Xeon Phi; Gauss-Seidel; multi-grid; loop fusion; directed acyclic graph; task scheduling; ICCG; MULTIPROCESSOR;
D O I
10.1177/1094342015593157
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents optimizations in a high-performance conjugate gradient benchmark (HPCG) for multi-core Intel((R)) Xeon((R)) processors and many-core Xeon Phi coprocessors. Without careful optimization, the HPCG benchmark under-utilizes the compute resources available in modern processors due to its low arithmetic intensity and challenges in parallelizing the Gauss-Seidel smoother (GS). Our optimized implementation fuses GS with sparse matrix vector multiplication (SpMV) to address the low arithmetic intensity, overcoming the performance otherwise bound by memory bandwidth. This fusion optimization is progressively more effective in newer generation Xeon processors, demonstrating the usefulness of their larger caches for sparse matrix operations: Sandy Bridge, Ivy Bridge, and Haswell processors achieve 93%, 99%, and 103%, respectively, of the ideal performance with a constraint that matrices are streamed from memory. Our implementation also parallelizes GS using fine-grain level-scheduling, a method that has been believed not to scale with many cores. Our GS implementation scales with 60 cores in Xeon Phi coprocessors, for the finest level of the multi-grid pre-conditioner. At the coarser levels, we address the limited parallelism using block multi-color re-ordering, achieving 21 GFLOPS with one Xeon Phi coprocessor. These optimizations distinguish our HPCG implementation from the others that stream most of the data from main memory and rely on multi-color re-ordering for parallelism. Our optimized implementation has been evaluated in clusters with various configurations, and we find that low-diameter high-radix network topologies such as Dragonfly realize high parallelization efficiencies because of fast all-reduce collectives. In addition, we demonstrate that our optimizations not only benefit the HPCG dataset, which is based on a structured 3D grid, but also a wide range of unstructured matrices.
引用
收藏
页码:11 / 27
页数:17
相关论文
共 50 条
  • [31] Evaluating the SW26010 many-core processor with a micro-benchmark suite for performance optimizations
    Lin, James
    Xu, Zhigeng
    Cai, Linjin
    Nukada, Akira
    Matsuoka, Satoshi
    PARALLEL COMPUTING, 2018, 77 : 128 - 143
  • [32] Compiling SIMT Programs on Multi- and Many-core Processors with Wide Vector Units: A Case Study with CUDA
    Wu, Hancheng
    Ravi, John
    Becchi, Michela
    2018 IEEE 25TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC), 2018, : 123 - 132
  • [33] A new power efficient high performance interconnection network for many-core processors
    Al Faisal, Faiz
    Rahman, M. M. Hafizur
    Inoguchi, Yasushi
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2017, 101 : 92 - 102
  • [34] CAP Bench: a benchmark suite for performance and energy evaluation of low-power many-core processors
    Souza, Matheus A.
    Penna, Pedro Henrique
    Queiroz, Matheus M.
    Pereira, Alyson D.
    Wanderley Goes, Luis Fabricio
    Freitas, Henrique C.
    Castro, Marcio
    Navaux, Philippe O. A.
    Mehaut, Jean-Francois
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2017, 29 (04):
  • [35] High-performance Multi/Many-core Network Processing Architectures with Shared and Private Queues
    Falamarzi, Reza
    Bahrambeigy, Bahram
    Ahmadi, Mahmood
    Rajabzade, Amir
    2015 7TH CONFERENCE ON INFORMATION AND KNOWLEDGE TECHNOLOGY (IKT), 2015,
  • [36] High-Performance Intrusion Response Planning on Many-Core Architectures
    Iannucci, Stefano
    Chen, Qian
    Abdelwahed, Sherif
    2016 25TH INTERNATIONAL CONFERENCE ON COMPUTER COMMUNICATIONS AND NETWORKS (ICCCN), 2016,
  • [37] Shenwei-26010: A High-Performance Many-Core Processor
    Hu X.
    Ke X.
    Yin F.
    Zhao X.
    Ma Y.
    Yan S.
    Ma C.
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2021, 58 (06): : 1155 - 1165
  • [38] High-Performance Cluster Estimation Using Many-Core Models
    Seo, Junsang
    Kang, Myeongsu
    Kim, Cheol-Hong
    Kim, Jong-Myon
    FRONTIER AND INNOVATION IN FUTURE COMPUTING AND COMMUNICATIONS, 2014, 301 : 193 - 201
  • [39] High-performance simulations of turbulent boundary layer flow using Intel Xeon Phi many-core processors
    Kang, Ji-Hoon
    Hwang, Jinyul
    Sung, Hyung Jin
    Ryu, Hoon
    JOURNAL OF SUPERCOMPUTING, 2021, 77 (09): : 9597 - 9614
  • [40] High-performance simulations of turbulent boundary layer flow using Intel Xeon Phi many-core processors
    Ji-Hoon Kang
    Jinyul Hwang
    Hyung Jin Sung
    Hoon Ryu
    The Journal of Supercomputing, 2021, 77 : 9597 - 9614