swATOP: Automatically Optimizing Deep Learning Operators on SW26010 Many-Core Processor

被引:1
|
作者
Gao, Wei [1 ]
Fang, Jiarui [1 ]
Zhao, Wenlai [1 ]
Yang, Jinzhe [2 ]
Wang, Long [3 ]
Gan, Lin [4 ]
Fu, Haohuan [4 ]
Yang, Guangwen [4 ]
机构
[1] Tsinghua Univ, Beijing, Peoples R China
[2] Imperial Coll London, London, England
[3] Syst Dept Baidu, Beijing, Peoples R China
[4] Tsinghua Univ, Natl Supercomp Ctr, Wuxi, Jiangsu, Peoples R China
基金
中国博士后科学基金; 中国国家自然科学基金;
关键词
Autotuning; Deep Learning Operators; SW26010;
D O I
10.1145/3337821.3337883
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Achieving an optimized mapping of Deep Learning (DL) operators to new hardware architectures is the key to building a scalable DL system. However, handcrafted optimization involves huge engineering efforts, due to the variety of DL operator implementations and complex programming skills. Targeting the innovative many-core processor SW26010 adopted by the 3rd fastest super-computer Sunway TaihuLight, an end-to-end automated framework called swATOP is presented as a more practical solution for DL operator optimization. Arithmetic intensive DL operators are expressed into an auto-tuning-friendly form, which is based on tensorized primitives. By describing the algorithm of a DL operator using our domain specific language (DSL), swATOP is able to derive and produce an optimal implementation by separating hardware-dependent optimization and hardware-agnostic optimization. Hardware-dependent optimization is encapsulated in a set of tensorized primitives with sufficient utilization of the underlying hardware features. The hardware-agnostic optimization contains a scheduler, an intermediate representation (IR) optimizer, an auto-tuner, and a code generator. These modules cooperate to perform an automatic design space exploration, to apply a set of programming techniques, to discover a near-optimal solution, and to generate the executable code. Our experiments show that swATOP is able to bring significant performance improvement on DL operators in over 88% of cases, compared with the best-handcrafted optimization. Compared to a black-box autotuner, the tuning and code generation time can be reduced to minutes from days using swATOP.
引用
收藏
页数:10
相关论文
共 50 条
  • [41] Federated Learning Platform on Embedded Many-core Processor with Flower
    Hasumi, Masahiro
    Azumi, Takuya
    2024 IEEE 3RD REAL-TIME AND INTELLIGENT EDGE COMPUTING WORKSHOP, RAGE 2024, 2024, : 37 - 42
  • [42] Optimizing massively parallel sparse matrix computing on ARM many-core processor
    Zheng, Jiang
    Jiang, Jiazhi
    Du, Jiangsu
    Huang, Dan
    Lu, Yutong
    PARALLEL COMPUTING, 2023, 117
  • [43] Benchmarking Data Analysis and Machine Learning Applications on the Intel KNL Many-Core Processor
    Byun, Chansup
    Kepner, Jeremy
    Arcand, William
    Bestor, David
    Bergeron, Bill
    Gadepally, Vijay
    Houle, Michael
    Hubbell, Matthew
    Jones, Michael
    Klein, Anna
    Michaleas, Peter
    Milechin, Lauren
    Mullen, Julie
    Prout, Andrew
    Rosa, Antonio
    Samsi, Siddharth
    Yee, Charles
    Reuther, Albert
    2017 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2017,
  • [44] A Many-Core Accelerator Design for On-Chip Deep Reinforcement Learning
    Wang, Ying
    Wang, Mengdi
    Li, Bing
    Li, Huawei
    Li, Xiaowei
    2020 IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER AIDED-DESIGN (ICCAD), 2020,
  • [45] PFSI.sw: A Programming Framework for Sea Ice Model Algorithms Based on Sunway Many-core Processor
    Li, Binyang
    Li, Bo
    Qian, Depei
    2017 IEEE 28TH INTERNATIONAL CONFERENCE ON APPLICATION-SPECIFIC SYSTEMS, ARCHITECTURES AND PROCESSORS (ASAP), 2017, : 119 - 126
  • [46] xMath2.0: a high-performance extended math library for SW26010-Pro many-core processor (Oct, 10.1007/s42514-022-00126-8, 2022)
    Liu, Fangfang
    Ma, Wenjing
    Zhao, Yuwen
    Chen, Daokun
    Hu, Yi
    Lu, Qinglin
    Yin, WanWang
    Yuan, Xinhui
    Jiang, Lijuan
    Yan, Hao
    Li, Min
    Wang, Hongsen
    Wang, Xinyu
    Yang, Chao
    CCF TRANSACTIONS ON HIGH PERFORMANCE COMPUTING, 2023, 5 (01) : 97 - 97
  • [47] ParaX: Boosting Deep Learning for Big Data Analytics on Many-Core CPUs
    Yin, Lujia
    Zhang, Yiming
    Zhang, Zhaoning
    Peng, Yuxing
    Zhao, Peng
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2021, 14 (06): : 864 - 877
  • [48] Optimizing Machine Learning Algorithms on Multi-core and Many-core Architectures using Thread and Data Mapping
    Serpa, Matheus S.
    Krause, Arthur M.
    Cruz, Eduardo H. M.
    Navaux, Philippe O. A.
    Pasin, Marcelo
    Felber, Pascal
    2018 26TH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, AND NETWORK-BASED PROCESSING (PDP 2018), 2018, : 329 - 333
  • [49] Hierarchical Model Parallelism for Optimizing Inference on Many-core Processor via Decoupled 3D-CNN Structure
    Jiang, Jiazhi
    Huang, Zijian
    Huang, Dan
    Du, Jiangsu
    Chen, Lin
    Chen, Ziguan
    Lu, Yutong
    ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2023, 20 (03)
  • [50] Research on Parallel Acceleration for Deep Learning Inference Based on Many-Core ARM Platform
    Zhu, Keqian
    Jiang, Jingfei
    ADVANCED COMPUTER ARCHITECTURE, 2018, 908 : 30 - 41