Optimizations in a high-performance conjugate gradient benchmark for IA-based multi- and many-core processors

被引:8
|
作者
Park, Jongsoo [1 ]
Smelyanskiy, Mikhail [1 ]
Vaidyanathan, Karthikeyan [2 ]
Heinecke, Alexander [1 ]
Kalamkar, Dhiraj D. [2 ]
Patwary, Md Mosotofa Ali [1 ]
Pirogov, Vadim [3 ]
Dubey, Pradeep [1 ]
Liu, Xing [4 ]
Rosales, Carlos [5 ]
Mazauric, Cyril [6 ]
Daley, Christopher [7 ]
机构
[1] Intel Corp, Parallel Comp Lab, 2200 Mission Coll Blvd, Santa Clara, CA 95051 USA
[2] Intel Corp, Parallel Comp Lab, Bangalore, Karnataka, India
[3] Intel Corp, Software & Serv Grp, Moscow, Russia
[4] IBM Res, TJ Watson Res Ctr, Richmond, VA USA
[5] Univ Texas Austin, Texas Adv Comp Ctr, Austin, TX 78712 USA
[6] Applicat & Performance Team, Bull, France
[7] Lawrence Berkeley Natl Lab, Natl Energy Res Sci Comp Ctr, Berkeley, CA USA
关键词
High-performance conjugate gradient; HPCG; conjugate gradient; Xeon Phi; Gauss-Seidel; multi-grid; loop fusion; directed acyclic graph; task scheduling; ICCG; MULTIPROCESSOR;
D O I
10.1177/1094342015593157
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents optimizations in a high-performance conjugate gradient benchmark (HPCG) for multi-core Intel((R)) Xeon((R)) processors and many-core Xeon Phi coprocessors. Without careful optimization, the HPCG benchmark under-utilizes the compute resources available in modern processors due to its low arithmetic intensity and challenges in parallelizing the Gauss-Seidel smoother (GS). Our optimized implementation fuses GS with sparse matrix vector multiplication (SpMV) to address the low arithmetic intensity, overcoming the performance otherwise bound by memory bandwidth. This fusion optimization is progressively more effective in newer generation Xeon processors, demonstrating the usefulness of their larger caches for sparse matrix operations: Sandy Bridge, Ivy Bridge, and Haswell processors achieve 93%, 99%, and 103%, respectively, of the ideal performance with a constraint that matrices are streamed from memory. Our implementation also parallelizes GS using fine-grain level-scheduling, a method that has been believed not to scale with many cores. Our GS implementation scales with 60 cores in Xeon Phi coprocessors, for the finest level of the multi-grid pre-conditioner. At the coarser levels, we address the limited parallelism using block multi-color re-ordering, achieving 21 GFLOPS with one Xeon Phi coprocessor. These optimizations distinguish our HPCG implementation from the others that stream most of the data from main memory and rely on multi-color re-ordering for parallelism. Our optimized implementation has been evaluated in clusters with various configurations, and we find that low-diameter high-radix network topologies such as Dragonfly realize high parallelization efficiencies because of fast all-reduce collectives. In addition, we demonstrate that our optimizations not only benefit the HPCG dataset, which is based on a structured 3D grid, but also a wide range of unstructured matrices.
引用
收藏
页码:11 / 27
页数:17
相关论文
共 50 条
  • [41] Architecture-based design and optimization of genetic algorithms on multi- and many-core systems
    Zheng, Long
    Lu, Yanchao
    Guo, Minyi
    Guo, Song
    Xu, Cheng-Zhong
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2014, 38 : 75 - 91
  • [42] High-performance multi/many-core architectures with shared and private queues: Network processing approaches
    Falamarzi, Reza
    Bahrambeigy, Bahram
    Ahmadi, Mahmood
    Rajabzadeh, Amir
    JOURNAL OF HIGH SPEED NETWORKS, 2018, 24 (02) : 89 - 106
  • [43] Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors
    Hou, Kaixi
    Feng, Wu-chun
    Che, Shuai
    2017 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2017, : 713 - 722
  • [44] Performance analysis of the Kahan-enhanced scalar product on current multi-core and many-core processors
    Hofmann, Johannes
    Fey, Dietmar
    Riedmann, Michael
    Eitzinger, Jan
    Hager, Georg
    Wellein, Gerhard
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2017, 29 (09):
  • [45] High-performance conjugate-gradient benchmark: A new metric for ranking high-performance computing systems
    Dongarra, Jack
    Heroux, Michael A.
    Luszczek, Piotr
    INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2016, 30 (01): : 3 - 10
  • [46] Parallel optimization using/for multi and many-core high performance computing
    Melab, Nouredine
    Zomaya, Albert Y.
    Chakroun, Imen
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2018, 112 : 109 - 110
  • [47] Performance and efficiency investigations of SIMD programs of Coulomb solvers on multi- and many-core systems with vector units
    Kramer, Ronny
    Ruenger, Gudula
    2020 28TH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING (PDP 2020), 2020, : 237 - 244
  • [48] Performance Extraction and Suitability Analysis of Multi- and Many-core Architectures for Next Generation Sequencing Secondary Analysis
    Misra, Sanchit
    Pan, Tony C.
    Mahadik, Kanak
    Powley, George
    Vaidya, Priya N.
    Vasimuddin, Md
    Aluru, Srinivas
    27TH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT 2018), 2018,
  • [49] Performance comparison of designated preprocessing white light interferometry algorithms on emerging multi- and many-core architectures
    Schneider, Max
    Fey, Dietmar
    Kapusi, Daniel
    Machleidt, Torsten
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE (ICCS), 2011, 4 : 2037 - 2046
  • [50] Performance Assessment of Hybrid Parallelism for Large-Scale Reservoir Simulation on Multi- and Many-core Architectures
    AlOnazi, Amani
    Rogowski, Marcin
    Al-Zawawi, Ahmed
    Keyes, David
    2018 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2018,