swATOP: Automatically Optimizing Deep Learning Operators on SW26010 Many-Core Processor

被引:1
|
作者
Gao, Wei [1 ]
Fang, Jiarui [1 ]
Zhao, Wenlai [1 ]
Yang, Jinzhe [2 ]
Wang, Long [3 ]
Gan, Lin [4 ]
Fu, Haohuan [4 ]
Yang, Guangwen [4 ]
机构
[1] Tsinghua Univ, Beijing, Peoples R China
[2] Imperial Coll London, London, England
[3] Syst Dept Baidu, Beijing, Peoples R China
[4] Tsinghua Univ, Natl Supercomp Ctr, Wuxi, Jiangsu, Peoples R China
基金
中国博士后科学基金; 中国国家自然科学基金;
关键词
Autotuning; Deep Learning Operators; SW26010;
D O I
10.1145/3337821.3337883
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Achieving an optimized mapping of Deep Learning (DL) operators to new hardware architectures is the key to building a scalable DL system. However, handcrafted optimization involves huge engineering efforts, due to the variety of DL operator implementations and complex programming skills. Targeting the innovative many-core processor SW26010 adopted by the 3rd fastest super-computer Sunway TaihuLight, an end-to-end automated framework called swATOP is presented as a more practical solution for DL operator optimization. Arithmetic intensive DL operators are expressed into an auto-tuning-friendly form, which is based on tensorized primitives. By describing the algorithm of a DL operator using our domain specific language (DSL), swATOP is able to derive and produce an optimal implementation by separating hardware-dependent optimization and hardware-agnostic optimization. Hardware-dependent optimization is encapsulated in a set of tensorized primitives with sufficient utilization of the underlying hardware features. The hardware-agnostic optimization contains a scheduler, an intermediate representation (IR) optimizer, an auto-tuner, and a code generator. These modules cooperate to perform an automatic design space exploration, to apply a set of programming techniques, to discover a near-optimal solution, and to generate the executable code. Our experiments show that swATOP is able to bring significant performance improvement on DL operators in over 88% of cases, compared with the best-handcrafted optimization. Compared to a black-box autotuner, the tuning and code generation time can be reduced to minutes from days using swATOP.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] Benchmarking SW26010 Many-core Processor
    Xu, Zhigeng
    Lin, James
    Matsuoka, Satoshi
    2017 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2017, : 743 - 752
  • [2] UNAT: UNstructured Acceleration Toolkit on SW26010 many-core processor
    Liu, Hongbin
    Ren, Hu
    Gu, Hanfeng
    Gao, Fei
    Yang, Guangwen
    ENGINEERING COMPUTATIONS, 2020, 37 (09) : 3187 - 3208
  • [3] Runtime Adaptive Matrix Multiplication for the SW26010 Many-Core Processor
    Wu, Zheng
    Li, Mingfan
    Chi, Mengxian
    Xu, Le
    An, Hong
    IEEE ACCESS, 2020, 8 : 156915 - 156928
  • [4] Towards Highly Efficient DGEMM on the Emerging SW26010 Many-core Processor
    Jiang, Lijuan
    Yang, Chao
    Ao, Yulong
    Yin, Wanwang
    Ma, Wenjing
    Sun, Qiao
    Liu, Fangfang
    Lin, Rongfen
    Zhang, Peng
    2017 46TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP), 2017, : 422 - 431
  • [5] SW-LZMA: Parallel Implementation of LZMA Based on SW26010 Many-Core Processor
    Li, Bingzheng
    Xu, Jinchen
    Liu, Zijing
    WIRELESS COMMUNICATIONS & MOBILE COMPUTING, 2021, 2021
  • [6] Enabling Highly Efficient Batched Matrix Multiplications on SW26010 Many-core Processor
    Jiang, Lijuan
    Yang, Chao
    Ma, Wenjing
    ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2020, 17 (01)
  • [7] A PARALLEL APPROACH FOR OIL PALM TREE DETECTION ON A SW26010 MANY-CORE PROCESSOR
    Zheng, Juepeng
    Wu, Wenzhao
    Zhao, Yi
    Yuan, Shuai
    Dong, Runmin
    Zhang, Lixian
    Fu, Haohuan
    2022 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2022), 2022, : 1548 - 1551
  • [8] Efficient Implementation of Multilevel Fast Multipole Algorithm on SW26010 Many-core Processor
    He, Wei-Jia
    Yang, Ming-Lin
    Sheng, Xin-Qing
    2020 IEEE MTT-S INTERNATIONAL CONFERENCE ON NUMERICAL ELECTROMAGNETIC AND MULTIPHYSICS MODELING AND OPTIMIZATION (NEMO 2020), 2020,
  • [9] Parallel SHA-256 on SW26010 many-core processor for hashing of multiple messages
    Ziheng Wang
    Xiaoshe Dong
    Yan Kang
    Heng Chen
    The Journal of Supercomputing, 2023, 79 : 2332 - 2355
  • [10] Implementation of Hybrid Alignment Algorithm for Protein Database Search on the SW26010 Many-Core Processor
    Zhang, Hao
    Fu, You
    Feng, Lu-Bin
    Zhang, Yue
    Hua, Rong
    IEEE ACCESS, 2019, 7 : 128054 - 128063