Optimizations in a high-performance conjugate gradient benchmark for IA-based multi- and many-core processors

被引:8
|
作者
Park, Jongsoo [1 ]
Smelyanskiy, Mikhail [1 ]
Vaidyanathan, Karthikeyan [2 ]
Heinecke, Alexander [1 ]
Kalamkar, Dhiraj D. [2 ]
Patwary, Md Mosotofa Ali [1 ]
Pirogov, Vadim [3 ]
Dubey, Pradeep [1 ]
Liu, Xing [4 ]
Rosales, Carlos [5 ]
Mazauric, Cyril [6 ]
Daley, Christopher [7 ]
机构
[1] Intel Corp, Parallel Comp Lab, 2200 Mission Coll Blvd, Santa Clara, CA 95051 USA
[2] Intel Corp, Parallel Comp Lab, Bangalore, Karnataka, India
[3] Intel Corp, Software & Serv Grp, Moscow, Russia
[4] IBM Res, TJ Watson Res Ctr, Richmond, VA USA
[5] Univ Texas Austin, Texas Adv Comp Ctr, Austin, TX 78712 USA
[6] Applicat & Performance Team, Bull, France
[7] Lawrence Berkeley Natl Lab, Natl Energy Res Sci Comp Ctr, Berkeley, CA USA
关键词
High-performance conjugate gradient; HPCG; conjugate gradient; Xeon Phi; Gauss-Seidel; multi-grid; loop fusion; directed acyclic graph; task scheduling; ICCG; MULTIPROCESSOR;
D O I
10.1177/1094342015593157
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents optimizations in a high-performance conjugate gradient benchmark (HPCG) for multi-core Intel((R)) Xeon((R)) processors and many-core Xeon Phi coprocessors. Without careful optimization, the HPCG benchmark under-utilizes the compute resources available in modern processors due to its low arithmetic intensity and challenges in parallelizing the Gauss-Seidel smoother (GS). Our optimized implementation fuses GS with sparse matrix vector multiplication (SpMV) to address the low arithmetic intensity, overcoming the performance otherwise bound by memory bandwidth. This fusion optimization is progressively more effective in newer generation Xeon processors, demonstrating the usefulness of their larger caches for sparse matrix operations: Sandy Bridge, Ivy Bridge, and Haswell processors achieve 93%, 99%, and 103%, respectively, of the ideal performance with a constraint that matrices are streamed from memory. Our implementation also parallelizes GS using fine-grain level-scheduling, a method that has been believed not to scale with many cores. Our GS implementation scales with 60 cores in Xeon Phi coprocessors, for the finest level of the multi-grid pre-conditioner. At the coarser levels, we address the limited parallelism using block multi-color re-ordering, achieving 21 GFLOPS with one Xeon Phi coprocessor. These optimizations distinguish our HPCG implementation from the others that stream most of the data from main memory and rely on multi-color re-ordering for parallelism. Our optimized implementation has been evaluated in clusters with various configurations, and we find that low-diameter high-radix network topologies such as Dragonfly realize high parallelization efficiencies because of fast all-reduce collectives. In addition, we demonstrate that our optimizations not only benefit the HPCG dataset, which is based on a structured 3D grid, but also a wide range of unstructured matrices.
引用
收藏
页码:11 / 27
页数:17
相关论文
共 50 条
  • [1] Analysis and Optimization of Financial Analytics Benchmark on Modern Multi- and Many-core IA-Based Architectures
    Smelyanskiy, Mikhail
    Sewall, Jason
    Kalamkar, Dhiraj D.
    Satish, Nadathur
    Dubey, Pradeep
    Astafiev, Nikita
    Burylov, Ilya
    Nikolaev, Andrey
    Maidanov, Sergey
    Li, Shuo
    Kulkarni, Sunil
    Finan, Charles H.
    Gonina, Ekaterina
    2012 SC COMPANION: HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SCC), 2012, : 1154 - 1162
  • [2] Fast parallel stream compaction for IA-based multi/many-core processors
    Sun, Qiao
    Yang, Chao
    Wu, Changmao
    Li, Leisheng
    Liu, Fangfang
    2016 16TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2016, : 736 - 745
  • [3] A Hierarchical Grid Algorithm for Accelerating High-Performance Conjugate Gradient Benchmark on Sunway Many-core Processor
    Liao, Chenzhi
    Chen, Junshi
    Han, Wenting
    Cao, Huanqi
    Su, Zhichao
    Yin, Wanwang
    An, Hong
    PROCEEDINGS OF THE 3RD INTERNATIONAL CONFERENCE ON COMMUNICATION AND INFORMATION PROCESSING (ICCIP 2017), 2017, : 361 - 368
  • [4] Optimization of Scan Algorithms on Multi- and Many-core Processors
    Sun, Qiao
    Yang, Chao
    2014 21ST INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC), 2014,
  • [5] Parallel space saving on multi- and many-core processors
    Cafaro, Massimo
    Pulimeno, Marco
    Epicoco, Italo
    Aloisio, Giovanni
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2018, 30 (07):
  • [6] Optimization of scan algorithms on multi- and many-core processors
    Sun, Qiao
    Yang, Chao
    2014 21st International Conference on High Performance Computing, HiPC 2014, 2014,
  • [7] Avionics Platform Design Optimization Considering Multi-/Many-core Processors
    Rockschies, Marius
    Thielecke, Frank
    2023 IEEE/AIAA 42ND DIGITAL AVIONICS SYSTEMS CONFERENCE, DASC, 2023,
  • [8] Scalable Optimal Greedy Scheduler for Asymmetric Multi-/Many-Core Processors
    Venkataramani, Vanchinathan
    Pathania, Anuj
    Mitra, Tulika
    EMBEDDED COMPUTER SYSTEMS: ARCHITECTURES, MODELING, AND SIMULATION, SAMOS 2019, 2019, 11733 : 127 - 141
  • [9] Multi-core optimization for conjugate gradient benchmark on heterogeneous processors
    邓林
    窦勇
    JournalofCentralSouthUniversityofTechnology, 2011, 18 (02) : 490 - 498
  • [10] Multi-core optimization for conjugate gradient benchmark on heterogeneous processors
    Deng Lin
    Dou Yong
    JOURNAL OF CENTRAL SOUTH UNIVERSITY OF TECHNOLOGY, 2011, 18 (02): : 490 - 498