swATOP: Automatically Optimizing Deep Learning Operators on SW26010 Many-Core Processor

被引：1

作者：

Gao, Wei ^{[1
]}

Fang, Jiarui ^{[1
]}

Zhao, Wenlai ^{[1
]}

Yang, Jinzhe ^{[2
]}

Wang, Long ^{[3
]}

Gan, Lin ^{[4
]}

Fu, Haohuan ^{[4
]}

Yang, Guangwen ^{[4
]}

机构：

[1] Tsinghua Univ, Beijing, Peoples R China

[2] Imperial Coll London, London, England

[3] Syst Dept Baidu, Beijing, Peoples R China

[4] Tsinghua Univ, Natl Supercomp Ctr, Wuxi, Jiangsu, Peoples R China

来源：

PROCEEDINGS OF THE 48TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP 2019) | 2019年

基金：

中国博士后科学基金; 中国国家自然科学基金;

关键词：

Autotuning; Deep Learning Operators; SW26010;

D O I：

10.1145/3337821.3337883

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Achieving an optimized mapping of Deep Learning (DL) operators to new hardware architectures is the key to building a scalable DL system. However, handcrafted optimization involves huge engineering efforts, due to the variety of DL operator implementations and complex programming skills. Targeting the innovative many-core processor SW26010 adopted by the 3rd fastest super-computer Sunway TaihuLight, an end-to-end automated framework called swATOP is presented as a more practical solution for DL operator optimization. Arithmetic intensive DL operators are expressed into an auto-tuning-friendly form, which is based on tensorized primitives. By describing the algorithm of a DL operator using our domain specific language (DSL), swATOP is able to derive and produce an optimal implementation by separating hardware-dependent optimization and hardware-agnostic optimization. Hardware-dependent optimization is encapsulated in a set of tensorized primitives with sufficient utilization of the underlying hardware features. The hardware-agnostic optimization contains a scheduler, an intermediate representation (IR) optimizer, an auto-tuner, and a code generator. These modules cooperate to perform an automatic design space exploration, to apply a set of programming techniques, to discover a near-optimal solution, and to generate the executable code. Our experiments show that swATOP is able to bring significant performance improvement on DL operators in over 88% of cases, compared with the best-handcrafted optimization. Compared to a black-box autotuner, the tuning and code generation time can be reduced to minutes from days using swATOP.

引用

页数：10

共 50 条

[41] Federated Learning Platform on Embedded Many-core Processor with Flower
Hasumi, Masahiro
Azumi, Takuya
2024 IEEE 3RD REAL-TIME AND INTELLIGENT EDGE COMPUTING WORKSHOP, RAGE 2024, 2024, : 37 - 42
[42] Optimizing massively parallel sparse matrix computing on ARM many-core processor
Zheng, Jiang
Jiang, Jiazhi
Du, Jiangsu
Huang, Dan
Lu, Yutong
PARALLEL COMPUTING, 2023, 117
[43] Benchmarking Data Analysis and Machine Learning Applications on the Intel KNL Many-Core Processor
Byun, Chansup
Kepner, Jeremy
Arcand, William
Bestor, David
Bergeron, Bill
Gadepally, Vijay
Houle, Michael
Hubbell, Matthew
Jones, Michael
Klein, Anna
Michaleas, Peter
Milechin, Lauren
Mullen, Julie
Prout, Andrew
Rosa, Antonio
Samsi, Siddharth
Yee, Charles
Reuther, Albert
2017 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2017,
[44] A Many-Core Accelerator Design for On-Chip Deep Reinforcement Learning
Wang, Ying
Wang, Mengdi
Li, Bing
Li, Huawei
Li, Xiaowei
2020 IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER AIDED-DESIGN (ICCAD), 2020,
[45] PFSI.sw: A Programming Framework for Sea Ice Model Algorithms Based on Sunway Many-core Processor
Li, Binyang
Li, Bo
Qian, Depei
2017 IEEE 28TH INTERNATIONAL CONFERENCE ON APPLICATION-SPECIFIC SYSTEMS, ARCHITECTURES AND PROCESSORS (ASAP), 2017, : 119 - 126
[46] xMath2.0: a high-performance extended math library for SW26010-Pro many-core processor (Oct, 10.1007/s42514-022-00126-8, 2022)
Liu, Fangfang
Ma, Wenjing
Zhao, Yuwen
Chen, Daokun
Hu, Yi
Lu, Qinglin
Yin, WanWang
Yuan, Xinhui
Jiang, Lijuan
Yan, Hao
Li, Min
Wang, Hongsen
Wang, Xinyu
Yang, Chao
CCF TRANSACTIONS ON HIGH PERFORMANCE COMPUTING, 2023, 5 (01) : 97 - 97
[47] ParaX: Boosting Deep Learning for Big Data Analytics on Many-Core CPUs
Yin, Lujia
Zhang, Yiming
Zhang, Zhaoning
Peng, Yuxing
Zhao, Peng
PROCEEDINGS OF THE VLDB ENDOWMENT, 2021, 14 (06): : 864 - 877
[48] Optimizing Machine Learning Algorithms on Multi-core and Many-core Architectures using Thread and Data Mapping
Serpa, Matheus S.
Krause, Arthur M.
Cruz, Eduardo H. M.
Navaux, Philippe O. A.
Pasin, Marcelo
Felber, Pascal
2018 26TH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, AND NETWORK-BASED PROCESSING (PDP 2018), 2018, : 329 - 333
[49] Hierarchical Model Parallelism for Optimizing Inference on Many-core Processor via Decoupled 3D-CNN Structure
Jiang, Jiazhi
Huang, Zijian
Huang, Dan
Du, Jiangsu
Chen, Lin
Chen, Ziguan
Lu, Yutong
ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2023, 20 (03)
[50] Research on Parallel Acceleration for Deep Learning Inference Based on Many-Core ARM Platform
Zhu, Keqian
Jiang, Jingfei
ADVANCED COMPUTER ARCHITECTURE, 2018, 908 : 30 - 41

← 1 2 3 4 5 →