AUGEM:Automatically Generate High Performance Dense Linear Algebra Kernels on x86 CPUs

被引:82
|
作者
Wang, Qian [1 ]
Zhang, Xianyi [1 ,2 ]
Zhang, Yunquan [3 ]
Yi, Qing [4 ]
机构
[1] Univ Chinese Acad Sci, Chinese Acad Sci, Inst Software, Beijing, Peoples R China
[2] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China
[3] Chinese Acad Sci, Inst Software, State Key Lab Comp Architecture, Beijing, Peoples R China
[4] Univ Colorado, Boulder, CO 80309 USA
关键词
DLA code optimization; code generation; auto-tuning;
D O I
10.1145/2503210.2503219
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Basic Liner algebra subprograms (BLAS) is a fundamental library in scientific computing. In this paper, we present a template-based optimization framework, AUGEM, which can automatically generate fully optimized assembly code for several dense linear algebra (DLA) kernels, such as GEMM, GEMV, AXPY and DOT, on varying multi-core CPUs without requiring any manual interference from developers. In particular, based on domain-specific knowledge about algorithms of the DLA kernels, we use a collection of parameterized code templates to formulate a number of commonly occurring instruction sequences within the optimized low-level C code of these DLA kernels. Then, our framework uses a specialized low-level C optimizer to identify instruction sequences that match the pre-defined code templates and thereby translates them into extremely efficient SSE/AVX instructions. The DLA kernels generated by our template-based approach surpass the implementations of Intel MKL and AMD ACML BLAS libraries, on both Intel Sandy Bridge and AMD Piledriver processors.
引用
收藏
页数:12
相关论文
共 20 条
  • [1] Automatic Generation of High-Performance FFT Kernels on Arm and X86 CPUs
    Li, Zhihao
    Jia, Haipeng
    Zhang, Yunquan
    Chen, Tun
    Yuan, Liang
    Vuduc, Richard
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2020, 31 (08) : 1925 - 1941
  • [2] A high-performance implementation of atomistic spin dynamics simulations on x86 CPUs
    Chen, Hongwei
    Zhai, Yujia
    Turner, Joshua J.
    Feiguin, Adrian
    COMPUTER PHYSICS COMMUNICATIONS, 2023, 291
  • [3] FT-GEMM: A Fault Tolerant High Performance GEMM Implementation on x86 CPUs
    Wu, Shixun
    Zhai, Yujia
    Huang, Jiajun
    Jian, Zizhe
    Chen, Zizhong
    PROCEEDINGS OF THE 32ND INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE PARALLEL AND DISTRIBUTED COMPUTING, HPDC 2023, 2023, : 323 - 324
  • [4] FT-BLAS: A Fault Tolerant High Performance BLAS Implementation on x86 CPUs
    Zhai, Yujia
    Giem, Elisabeth
    Zhao, Kai
    Liu, Jinyang
    Huang, Jiajun
    Wong, Bryan M.
    Shelton, Christian R.
    Chen, Zizhong
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2023, 34 (12) : 3207 - 3223
  • [5] A performance benchmarking analysis of Hypervisors Containers and Unikernels on ARMv8 and x86 CPUs
    Acharya, Ashijeet
    Fanguede, Jeremy
    Paolino, Michele
    Raho, Daniel
    2018 EUROPEAN CONFERENCE ON NETWORKS AND COMMUNICATIONS (EUCNC), 2018, : 282 - 287
  • [6] High-Performance Deep-Learning Coprocessor Integrated into x86 SoC with Server-Class CPUs
    Henry, Glenn
    Palangpour, Parviz
    Thomson, Michael
    Gardner, J. Scott
    Arden, Bryce
    Donahue, Jim
    Houck, Kimble
    Johnson, Jonathan
    O'Brien, Kyle
    Petersen, Scott
    Seroussi, Benjamin
    Walker, Tyler
    2020 ACM/IEEE 47TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA 2020), 2020, : 15 - 26
  • [7] Helium: Lifting High-Performance Stencil Kernels from Stripped x86 Binaries to Halide DSL Code
    Mendis, Charith
    Bosboom, Jeffrey
    Wu, Kevin
    Kamil, Shoaib
    Ragan-Kelley, Jonathan
    Paris, Sylvain
    Zhao, Qin
    Amarasinghe, Saman
    ACM SIGPLAN NOTICES, 2015, 50 (06) : 391 - 402
  • [8] 6x86: The Cyrix solution to executing x86 binaries on a high performance microprocessor
    Mcmahan, SC
    Bluhm, M
    Garibay, RA
    PROCEEDINGS OF THE IEEE, 1995, 83 (12) : 1664 - 1672
  • [9] High Performance Dense Linear Algebra on a Spatially Distributed Processor
    Diamond, Jeff
    Robatmili, Behnam
    Keckler, Stephen W.
    van de Geijn, Robert
    Goto, Kazushige
    Burger, Doug
    PPOPP'08: PROCEEDINGS OF THE 2008 ACM SIGPLAN SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING, 2008, : 63 - 72
  • [10] Zen: An Energy-Efficient High-Performance x86 Core
    Singh, Teja
    Schaefer, Alex
    Rangarajan, Sundar
    John, Deepesh
    Henrion, Carson
    Schreiber, Russell
    Rodriguez, Miguel
    Kosonocky, Stephen
    Naffziger, Samuel
    Novak, Amy
    IEEE JOURNAL OF SOLID-STATE CIRCUITS, 2018, 53 (01) : 102 - 114