Simba: Scaling Deep-Learning Inference with Chiplet-Based Architecture

被引:6
|
作者
Shao, Yakun Sophia [1 ,2 ]
Cemons, Jason [2 ]
Venkatesan, Rangharajan [3 ]
Zimmer, Brian [3 ]
Fojtik, Matthew [4 ]
Jiang, Nan [5 ]
Keller, Ben [3 ]
Klinefelter, Alicia [4 ]
Pinckney, Nathaniel [2 ]
Raina, Priyanka [6 ]
Tell, Stephen G. [4 ]
Zhang, Yanqing [3 ]
Dally, William J. [6 ,7 ]
Emer, Joel [5 ,8 ]
Gray, C. Thomas [4 ]
Khailany, Brucek [2 ]
Keckler, Stephen W. [2 ]
机构
[1] Univ Calif Berkeley, Berkeley, CA 94720 USA
[2] NVIDIA, Austin, TX USA
[3] NVIDIA, Santa Clara, CA USA
[4] NVIDIA, Durham, NC USA
[5] NVIDIA, Westford, MA USA
[6] Stanford Univ, Stanford, CA 94305 USA
[7] NVIDIA, Incline Village, NV USA
[8] MIT, 77 Massachusetts Ave, Cambridge, MA 02139 USA
关键词
D O I
10.1145/3460227
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Package-level integration using multi-chip-modules (MCMs) is a promising approach for building large-scale systems. Compared to a large monolithic die, an MCM combines many smaller chiplets into a larger system, substantially reducing fabrication and design costs. Current MCMs typically only contain a handful of coarse-grained large chiplets due to the high area, performance, and energy overheads associated with inter-chiplet communication. This work investigates and quantifies the costs and benefits of using MCMs with fine-grained chiplets for deep learning inference, an application domain with large compute and on-chip storage requirements. To evaluate the approach, we architected, implemented, fabricated, and tested Simba, a 36-chiplet prototype MCM system for deep-learning inference. Each chiplet achieves 4 TOPS peak performance, and the 36-chiplet MCM package achieves up to 128 TOPS and up to 6.1 TOPS/W. The MCM is configurable to support a flexible mapping of DNN layers to the distributed compute and storage units. To mitigate inter-chiplet communication overheads, we introduce three tiling optimizations that improve data locality. These optimizations achieve up to 16% speedup compared to the baseline layer mapping. Our evaluation shows that Simba can process 1988 images/s running ResNet-50 with a batch size of one, delivering an inference latency of 0.50 ms.
引用
收藏
页码:107 / 116
页数:10
相关论文
共 50 条
  • [1] Scaling Deep-Learning Inference with Chiplet-based Architecture and Photonic Interconnects
    Li, Yuan
    Louri, Ahmed
    Karanth, Avinash
    [J]. 2021 58TH ACM/IEEE DESIGN AUTOMATION CONFERENCE (DAC), 2021, : 931 - 936
  • [2] Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture
    Shao, Yakun Sophia
    Clemons, Jason
    Venkatesan, Rangharajan
    Zimmer, Brian
    Fojtik, Matthew
    Jiang, Nan
    Keller, Ben
    Klinefelter, Alicia
    Pinckney, Nathaniel
    Raina, Priyanka
    Tell, Stephen G.
    Zhang, Yanqing
    Dally, William J.
    Emer, Joel
    Gray, C. Thomas
    Khailany, Brucek
    Keckler, Stephen W.
    [J]. MICRO'52: THE 52ND ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE, 2019, : 14 - 27
  • [3] Review of chiplet-based design: system architecture and interconnection
    Liu, Yafei
    Li, Xiangyu
    Yin, Shouyi
    [J]. SCIENCE CHINA-INFORMATION SCIENCES, 2024, 67 (10)
  • [4] Review of chiplet-based design: system architecture and interconnection
    Yafei LIU
    Xiangyu LI
    Shouyi YIN
    [J]. Science China(Information Sciences), 2024, 67 (10) : 5 - 24
  • [5] Deep Reinforcement Learning-Based Power Management for Chiplet-Based Multicore Systems
    Li, Xiao
    Chen, Lin
    Chen, Shixi
    Jiang, Fan
    Li, Chengeng
    Zhang, Wei
    Xu, Jiang
    [J]. IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2024, 32 (09) : 1726 - 1739
  • [6] Power Management for Chiplet-Based Multicore Systems Using Deep Reinforcement Learning
    Li, Xiao
    Chen, Lin
    Chen, Shixi
    Jiang, Fan
    Li, Chengeng
    Xu, Jiang
    [J]. 2022 IEEE COMPUTER SOCIETY ANNUAL SYMPOSIUM ON VLSI (ISVLSI 2022), 2022, : 164 - 169
  • [7] Chiplet-GAN: Chiplet-Based Accelerator Design for Scalable Generative Adversarial Network Inference
    Chen, Yuechen
    Louri, Ahmed
    Lombardi, Fabrizio
    Liu, Shanshan
    [J]. IEEE CIRCUITS AND SYSTEMS MAGAZINE, 2024, 24 (03) : 19 - 33
  • [8] ChipletNP: Chiplet-Based Agile Customizable Network Processor Architecture
    Li, Tao
    Yang, Hui
    Li, Junnan
    Liu, Rulin
    Sun, Zhigang
    [J]. Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2024, 61 (12): : 2952 - 2968
  • [9] A Chiplet Prototype System for Deep Learning Inference
    Jerger, Natalie Enright
    [J]. COMMUNICATIONS OF THE ACM, 2021, 64 (06) : 106 - 106