AUGEM:Automatically Generate High Performance Dense Linear Algebra Kernels on x86 CPUs

被引：82

作者：

Wang, Qian ^{[1
]}

Zhang, Xianyi ^{[1
,2
]}

Zhang, Yunquan ^{[3
]}

Yi, Qing ^{[4
]}

机构：

[1] Univ Chinese Acad Sci, Chinese Acad Sci, Inst Software, Beijing, Peoples R China

[2] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China

[3] Chinese Acad Sci, Inst Software, State Key Lab Comp Architecture, Beijing, Peoples R China

[4] Univ Colorado, Boulder, CO 80309 USA

来源：

2013 INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC) | 2013年

关键词：

DLA code optimization; code generation; auto-tuning;

D O I：

10.1145/2503210.2503219

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Basic Liner algebra subprograms (BLAS) is a fundamental library in scientific computing. In this paper, we present a template-based optimization framework, AUGEM, which can automatically generate fully optimized assembly code for several dense linear algebra (DLA) kernels, such as GEMM, GEMV, AXPY and DOT, on varying multi-core CPUs without requiring any manual interference from developers. In particular, based on domain-specific knowledge about algorithms of the DLA kernels, we use a collection of parameterized code templates to formulate a number of commonly occurring instruction sequences within the optimized low-level C code of these DLA kernels. Then, our framework uses a specialized low-level C optimizer to identify instruction sequences that match the pre-defined code templates and thereby translates them into extremely efficient SSE/AVX instructions. The DLA kernels generated by our template-based approach surpass the implementations of Intel MKL and AMD ACML BLAS libraries, on both Intel Sandy Bridge and AMD Piledriver processors.

引用

页数：12

共 20 条

[1] Automatic Generation of High-Performance FFT Kernels on Arm and X86 CPUs
Li, Zhihao
Jia, Haipeng
Zhang, Yunquan
Chen, Tun
Yuan, Liang
Vuduc, Richard
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2020, 31 (08) : 1925 - 1941
[2] A high-performance implementation of atomistic spin dynamics simulations on x86 CPUs
Chen, Hongwei
Zhai, Yujia
Turner, Joshua J.
Feiguin, Adrian
COMPUTER PHYSICS COMMUNICATIONS, 2023, 291
[3] FT-GEMM: A Fault Tolerant High Performance GEMM Implementation on x86 CPUs
Wu, Shixun
Zhai, Yujia
Huang, Jiajun
Jian, Zizhe
Chen, Zizhong
PROCEEDINGS OF THE 32ND INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE PARALLEL AND DISTRIBUTED COMPUTING, HPDC 2023, 2023, : 323 - 324
[4] FT-BLAS: A Fault Tolerant High Performance BLAS Implementation on x86 CPUs
Zhai, Yujia
Giem, Elisabeth
Zhao, Kai
Liu, Jinyang
Huang, Jiajun
Wong, Bryan M.
Shelton, Christian R.
Chen, Zizhong
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2023, 34 (12) : 3207 - 3223
[5] A performance benchmarking analysis of Hypervisors Containers and Unikernels on ARMv8 and x86 CPUs
Acharya, Ashijeet
Fanguede, Jeremy
Paolino, Michele
Raho, Daniel
2018 EUROPEAN CONFERENCE ON NETWORKS AND COMMUNICATIONS (EUCNC), 2018, : 282 - 287
[6] High-Performance Deep-Learning Coprocessor Integrated into x86 SoC with Server-Class CPUs
Henry, Glenn
Palangpour, Parviz
Thomson, Michael
Gardner, J. Scott
Arden, Bryce
Donahue, Jim
Houck, Kimble
Johnson, Jonathan
O'Brien, Kyle
Petersen, Scott
Seroussi, Benjamin
Walker, Tyler
2020 ACM/IEEE 47TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA 2020), 2020, : 15 - 26
[7] Helium: Lifting High-Performance Stencil Kernels from Stripped x86 Binaries to Halide DSL Code
Mendis, Charith
Bosboom, Jeffrey
Wu, Kevin
Kamil, Shoaib
Ragan-Kelley, Jonathan
Paris, Sylvain
Zhao, Qin
Amarasinghe, Saman
ACM SIGPLAN NOTICES, 2015, 50 (06) : 391 - 402
[8] 6x86: The Cyrix solution to executing x86 binaries on a high performance microprocessor
Mcmahan, SC
Bluhm, M
Garibay, RA
PROCEEDINGS OF THE IEEE, 1995, 83 (12) : 1664 - 1672
[9] High Performance Dense Linear Algebra on a Spatially Distributed Processor
Diamond, Jeff
Robatmili, Behnam
Keckler, Stephen W.
van de Geijn, Robert
Goto, Kazushige
Burger, Doug
PPOPP'08: PROCEEDINGS OF THE 2008 ACM SIGPLAN SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING, 2008, : 63 - 72
[10] Zen: An Energy-Efficient High-Performance x86 Core
Singh, Teja
Schaefer, Alex
Rangarajan, Sundar
John, Deepesh
Henrion, Carson
Schreiber, Russell
Rodriguez, Miguel
Kosonocky, Stephen
Naffziger, Samuel
Novak, Amy
IEEE JOURNAL OF SOLID-STATE CIRCUITS, 2018, 53 (01) : 102 - 114

← 1 2 →