Simba: Scaling Deep-Learning Inference with Chiplet-Based Architecture

被引：6

作者：

Shao, Yakun Sophia ^{[1
,2
]}

Cemons, Jason ^{[2
]}

Venkatesan, Rangharajan ^{[3
]}

Zimmer, Brian ^{[3
]}

Fojtik, Matthew ^{[4
]}

Jiang, Nan ^{[5
]}

Keller, Ben ^{[3
]}

Klinefelter, Alicia ^{[4
]}

Pinckney, Nathaniel ^{[2
]}

Raina, Priyanka ^{[6
]}

Tell, Stephen G. ^{[4
]}

Zhang, Yanqing ^{[3
]}

Dally, William J. ^{[6
,7
]}

Emer, Joel ^{[5
,8
]}

Gray, C. Thomas ^{[4
]}

Khailany, Brucek ^{[2
]}

Keckler, Stephen W. ^{[2
]}

机构：

[1] Univ Calif Berkeley, Berkeley, CA 94720 USA

[2] NVIDIA, Austin, TX USA

[3] NVIDIA, Santa Clara, CA USA

[4] NVIDIA, Durham, NC USA

[5] NVIDIA, Westford, MA USA

[6] Stanford Univ, Stanford, CA 94305 USA

[7] NVIDIA, Incline Village, NV USA

[8] MIT, 77 Massachusetts Ave, Cambridge, MA 02139 USA

来源：

COMMUNICATIONS OF THE ACM | 2021年 / 64卷 / 06期

关键词：

D O I：

10.1145/3460227

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Package-level integration using multi-chip-modules (MCMs) is a promising approach for building large-scale systems. Compared to a large monolithic die, an MCM combines many smaller chiplets into a larger system, substantially reducing fabrication and design costs. Current MCMs typically only contain a handful of coarse-grained large chiplets due to the high area, performance, and energy overheads associated with inter-chiplet communication. This work investigates and quantifies the costs and benefits of using MCMs with fine-grained chiplets for deep learning inference, an application domain with large compute and on-chip storage requirements. To evaluate the approach, we architected, implemented, fabricated, and tested Simba, a 36-chiplet prototype MCM system for deep-learning inference. Each chiplet achieves 4 TOPS peak performance, and the 36-chiplet MCM package achieves up to 128 TOPS and up to 6.1 TOPS/W. The MCM is configurable to support a flexible mapping of DNN layers to the distributed compute and storage units. To mitigate inter-chiplet communication overheads, we introduce three tiling optimizations that improve data locality. These optimizations achieve up to 16% speedup compared to the baseline layer mapping. Our evaluation shows that Simba can process 1988 images/s running ResNet-50 with a batch size of one, delivering an inference latency of 0.50 ms.

引用

页码：107 / 116

页数：10

共 50 条

[1] Scaling Deep-Learning Inference with Chiplet-based Architecture and Photonic Interconnects
Li, Yuan
Louri, Ahmed
Karanth, Avinash
[J]. 2021 58TH ACM/IEEE DESIGN AUTOMATION CONFERENCE (DAC), 2021, : 931 - 936
[2] Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture
Shao, Yakun Sophia
Clemons, Jason
Venkatesan, Rangharajan
Zimmer, Brian
Fojtik, Matthew
Jiang, Nan
Keller, Ben
Klinefelter, Alicia
Pinckney, Nathaniel
Raina, Priyanka
Tell, Stephen G.
Zhang, Yanqing
Dally, William J.
Emer, Joel
Gray, C. Thomas
Khailany, Brucek
Keckler, Stephen W.
[J]. MICRO'52: THE 52ND ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE, 2019, : 14 - 27
[3] Review of chiplet-based design: system architecture and interconnection
Liu, Yafei
Li, Xiangyu
Yin, Shouyi
[J]. SCIENCE CHINA-INFORMATION SCIENCES, 2024, 67 (10)
[4] Review of chiplet-based design: system architecture and interconnection
Yafei LIU
Xiangyu LI
Shouyi YIN
[J]. Science China(Information Sciences), 2024, 67 (10) : 5 - 24
[5] Deep Reinforcement Learning-Based Power Management for Chiplet-Based Multicore Systems
Li, Xiao
Chen, Lin
Chen, Shixi
Jiang, Fan
Li, Chengeng
Zhang, Wei
Xu, Jiang
[J]. IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2024, 32 (09) : 1726 - 1739
[6] Power Management for Chiplet-Based Multicore Systems Using Deep Reinforcement Learning
Li, Xiao
Chen, Lin
Chen, Shixi
Jiang, Fan
Li, Chengeng
Xu, Jiang
[J]. 2022 IEEE COMPUTER SOCIETY ANNUAL SYMPOSIUM ON VLSI (ISVLSI 2022), 2022, : 164 - 169
[7] Chiplet-GAN: Chiplet-Based Accelerator Design for Scalable Generative Adversarial Network Inference
Chen, Yuechen
Louri, Ahmed
Lombardi, Fabrizio
Liu, Shanshan
[J]. IEEE CIRCUITS AND SYSTEMS MAGAZINE, 2024, 24 (03) : 19 - 33
[8] ChipletNP: Chiplet-Based Agile Customizable Network Processor Architecture
Li, Tao
Yang, Hui
Li, Junnan
Liu, Rulin
Sun, Zhigang
[J]. Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2024, 61 (12): : 2952 - 2968
[9] A Chiplet Prototype System for Deep Learning Inference
Jerger, Natalie Enright
[J]. COMMUNICATIONS OF THE ACM, 2021, 64 (06) : 106 - 106
[10] ChipAI: A scalable chiplet-based accelerator for efficient DNN inference using silicon photonics
[J]. Zhang, Hao (hao.zhang@postgrad.otago.ac.nz), 2025, 158

← 1 2 3 4 5 →