Optimizations in a high-performance conjugate gradient benchmark for IA-based multi- and many-core processors

被引:8
|
作者
Park, Jongsoo [1 ]
Smelyanskiy, Mikhail [1 ]
Vaidyanathan, Karthikeyan [2 ]
Heinecke, Alexander [1 ]
Kalamkar, Dhiraj D. [2 ]
Patwary, Md Mosotofa Ali [1 ]
Pirogov, Vadim [3 ]
Dubey, Pradeep [1 ]
Liu, Xing [4 ]
Rosales, Carlos [5 ]
Mazauric, Cyril [6 ]
Daley, Christopher [7 ]
机构
[1] Intel Corp, Parallel Comp Lab, 2200 Mission Coll Blvd, Santa Clara, CA 95051 USA
[2] Intel Corp, Parallel Comp Lab, Bangalore, Karnataka, India
[3] Intel Corp, Software & Serv Grp, Moscow, Russia
[4] IBM Res, TJ Watson Res Ctr, Richmond, VA USA
[5] Univ Texas Austin, Texas Adv Comp Ctr, Austin, TX 78712 USA
[6] Applicat & Performance Team, Bull, France
[7] Lawrence Berkeley Natl Lab, Natl Energy Res Sci Comp Ctr, Berkeley, CA USA
关键词
High-performance conjugate gradient; HPCG; conjugate gradient; Xeon Phi; Gauss-Seidel; multi-grid; loop fusion; directed acyclic graph; task scheduling; ICCG; MULTIPROCESSOR;
D O I
10.1177/1094342015593157
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents optimizations in a high-performance conjugate gradient benchmark (HPCG) for multi-core Intel((R)) Xeon((R)) processors and many-core Xeon Phi coprocessors. Without careful optimization, the HPCG benchmark under-utilizes the compute resources available in modern processors due to its low arithmetic intensity and challenges in parallelizing the Gauss-Seidel smoother (GS). Our optimized implementation fuses GS with sparse matrix vector multiplication (SpMV) to address the low arithmetic intensity, overcoming the performance otherwise bound by memory bandwidth. This fusion optimization is progressively more effective in newer generation Xeon processors, demonstrating the usefulness of their larger caches for sparse matrix operations: Sandy Bridge, Ivy Bridge, and Haswell processors achieve 93%, 99%, and 103%, respectively, of the ideal performance with a constraint that matrices are streamed from memory. Our implementation also parallelizes GS using fine-grain level-scheduling, a method that has been believed not to scale with many cores. Our GS implementation scales with 60 cores in Xeon Phi coprocessors, for the finest level of the multi-grid pre-conditioner. At the coarser levels, we address the limited parallelism using block multi-color re-ordering, achieving 21 GFLOPS with one Xeon Phi coprocessor. These optimizations distinguish our HPCG implementation from the others that stream most of the data from main memory and rely on multi-color re-ordering for parallelism. Our optimized implementation has been evaluated in clusters with various configurations, and we find that low-diameter high-radix network topologies such as Dragonfly realize high parallelization efficiencies because of fast all-reduce collectives. In addition, we demonstrate that our optimizations not only benefit the HPCG dataset, which is based on a structured 3D grid, but also a wide range of unstructured matrices.
引用
收藏
页码:11 / 27
页数:17
相关论文
共 50 条
  • [21] Scalable High-Performance Parallel Design for Network Intrusion Detection Systems on Many-Core Processors
    Jiang, Haiyang
    Zhang, Guangxing
    Xie, Gaogang
    Salamatian, Kave
    Mathy, Laurent
    2013 ACM/IEEE SYMPOSIUM ON ARCHITECTURES FOR NETWORKING AND COMMUNICATIONS SYSTEMS (ANCS), 2013, : 137 - 146
  • [22] High performance in silico virtual drug screening on many-core processors
    McIntosh-Smith, Simon
    Price, James
    Sessions, Richard B.
    Ibarra, Amaurys A.
    INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2015, 29 (02): : 119 - 134
  • [23] Performance analysis of the high-performance conjugate gradient benchmark on GPUs
    Phillips, Everett
    Fatica, Massimiliano
    INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2016, 30 (01): : 28 - 38
  • [24] Low Overhead Message Passing for High Performance Many-Core Processors
    Kumar, Sumeet S.
    Djie, Mitzi Tjin A.
    van Leuken, Rene
    2013 FIRST INTERNATIONAL SYMPOSIUM ON COMPUTING AND NETWORKING (CANDAR), 2013, : 345 - 351
  • [25] Parallel HEVC Decoding on Multi- and Many-core Architectures A Power and Performance Analysis
    Chi, Chi Ching
    Alvarez-Mesa, Mauricio
    Lucas, Jan
    Juurlink, Ben
    Schierl, Thomas
    JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2013, 71 (03): : 247 - 260
  • [26] Performance and Scalability Study of FMM Kernels on Novel Multi- and Many-core Architectures
    Rey, Anton
    Igual, Francisco D.
    Prieto-Matias, Manuel
    Prins, Jan F.
    INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE (ICCS 2017), 2017, 108 : 2313 - 2317
  • [27] A Nanophotonic Interconnect for High-Performance Many-Core Computation
    Beausoleil, R. G.
    Fiorentino, M.
    Ahn, J.
    Binkert, N.
    Davis, A.
    Fattal, D.
    Jouppi, N. P.
    McLaren, M.
    Santori, C. M.
    Schreiber, R. S.
    Spillane, S. M.
    Vantrease, D.
    Xu, Q.
    2008 5TH IEEE INTERNATIONAL CONFERENCE ON GROUP IV PHOTONICS, 2008, : 365 - 367
  • [28] A nanophotonic interconnect for high-performance many-core computation
    Beausoleil, R. G.
    Ahn, J.
    Binkert, N.
    Davis, A.
    Fattal, D.
    Fiorentino, M.
    Jouppi, N. P.
    McLaren, M.
    Santori, C. M.
    Schreiber, R. S.
    Spillane, S. M.
    Vantrease, D.
    Xu, Q.
    16TH ANNUAL IEEE SYMPOSIUM ON HIGH-PERFORMANCE INTERCONNECTS, PROCEEDINGS, 2008, : 182 - 189
  • [29] Parallel HEVC Decoding on Multi- and Many-core ArchitecturesA Power and Performance Analysis
    Chi Ching Chi
    Mauricio Alvarez-Mesa
    Jan Lucas
    Ben Juurlink
    Thomas Schierl
    Journal of Signal Processing Systems, 2013, 71 : 247 - 260
  • [30] Special Issue: Exploring the Frontiers of Computing Science and Technology: Adapting Emerging Multi- and Many-core Processors
    Zhou, Shujia
    Yesha, Yelena
    Halem, Milton
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2009, 21 (17): : 2141 - 2142