Quantization and Hardware Architecture Co-Design for Matrix-Vector Multiplications of Large Language Models

被引:1
|
作者
Li, Wenjie [1 ]
Hu, Aokun [1 ]
Xu, Ningyi [1 ]
He, Guanghui [2 ,3 ]
机构
[1] Shanghai Jiao Tong Univ, Sch Elect Informat & Elect Engn, Shanghai 200240, Peoples R China
[2] Shanghai Jiao Tong Univ, Dept Micro Nano Elect, Shanghai 200240, Peoples R China
[3] Shanghai Jiao Tong Univ, MoE Key Lab Artificial Intelligence, Shanghai 200240, Peoples R China
基金
中国国家自然科学基金;
关键词
Large language models; quantization; hardware architecture; precision-scalable; outlier;
D O I
10.1109/TCSI.2024.3350661
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Large language models (LLMs) have sparked a new revolution in the field of natural language processing (NLP), and have garnered tremendous attention in both academic research and everyday life, thanks to their unprecedented performance in a wide range of applications. However, their deployment remains a significant challenge, primarily due to their intensive computational and memory requirements. Hardware acceleration and efficient quantization are promising solutions to address the two issues. In this paper, a quantization and hardware architecture co-design is presented for matrix-vector multiplications (MVMs) of LLMs. During quantization, we uniformly group weights and activations to ensure workload balance for hardware. To enhance the performance of quantization, we further propose two approaches called channel sorting and channel selection, which can be applied simultaneously. To support the proposed quantization scheme, we develop two precision-scalable MVM hardware architectures. They are specifically designed for high speed and high energy efficiency, respectively. Experimental results show that our proposed quantization scheme achieves state-of-the-art performance among all the reported post-training schemes that quantize both weights and activations into integers. Compared to MVM architecture of the state-of-the-art LLM accelerator OliVe, our design exhibits significant advantages in terms of area efficiency and energy efficiency.
引用
收藏
页码:2858 / 2871
页数:14
相关论文
共 50 条
  • [1] Accelerating Matrix-Vector Multiplications of Large Language Models via Efficient Encoding
    Tao, Yongjin
    Sun, Wendi
    Chen, Song
    Kang, Yi
    [J]. 2024 IEEE 17th International Conference on Solid-State and Integrated Circuit Technology, ICSICT 2024, 2024,
  • [2] A Systematic Approach for Acceleration of Matrix-Vector Operations in CGRA through Algorithm-Architecture Co-design
    Merchant, Farhad
    Vatwani, Tarun
    Chattopadhyay, Anupam
    Raha, Soumyendu
    Nandy, S. K.
    Narayan, Ranjani
    Leupers, Rainer
    [J]. 2019 32ND INTERNATIONAL CONFERENCE ON VLSI DESIGN AND 2019 18TH INTERNATIONAL CONFERENCE ON EMBEDDED SYSTEMS (VLSID), 2019, : 64 - 69
  • [3] VLSI realization of learning vector quantization with hardware/software co-design for different applications
    An, Fengwei
    Akazawa, Toshinobu
    Yamasaki, Shogo
    Chen, Lei
    Mattausch, Hans Juergen
    [J]. JAPANESE JOURNAL OF APPLIED PHYSICS, 2015, 54 (04)
  • [4] An Efficient Fault-Tolerance Design for Integer Parallel Matrix-Vector Multiplications
    Gao, Zhen
    Jing, Qingqing
    Li, Yumeng
    Reviriego, Pedro
    Maestro, Juan Antonio
    [J]. IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2018, 26 (01) : 211 - 215
  • [5] A programming language for hardware/software co-design
    Watt, DR
    May, D
    [J]. COMMUNICATING PROCESS ARCHITECTURES 2001, 2001, 59 : 167 - 178
  • [6] Optimizing Sparse Matrix-Vector Multiplications on an ARMv8-based Many-Core Architecture
    Chen, Donglin
    Fang, Jianbin
    Chen, Shizhao
    Xu, Chuanfu
    Wang, Zheng
    [J]. INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2019, 47 (03) : 418 - 432
  • [7] A Hardware/Software Co-design Architecture for Packet Classification
    Ahmed, O.
    Chattha, K.
    Areibi, S.
    [J]. 2010 INTERNATIONAL CONFERENCE ON MICROELECTRONICS, 2010, : 96 - 99
  • [8] Co-Z ECC scalar multiplications for hardware, software and hardware-software co-design on embedded systems
    Baldwin, Brian
    Goundar, Raveen R.
    Hamilton, Mark
    Marnane, William P.
    [J]. JOURNAL OF CRYPTOGRAPHIC ENGINEERING, 2012, 2 (04) : 221 - 240
  • [9] Resource models and pre-compiler specification for hardware/software co-design language
    Jin, NY
    He, JF
    [J]. PROCEEDINGS OF THE SECOND INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING AND FORMAL METHODS, 2004, : 132 - 141
  • [10] Fast Matrix-vector Multiplications for Large-scale Logistic Regression on Shared-memory Systems
    Lee, Mu-Chu
    Chiang, Wei-Lin
    Lin, Chih-Jen
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2015, : 835 - 840