PockEngine: Sparse and Efficient Fine-tuning in a Pocket

被引:0
|
作者
Zhu, Ligeng [1 ]
Hu, Lanxiang [2 ]
Lin, Ji [1 ]
Wang, Wei-Chen [1 ]
Chen, Wei-Ming [1 ]
Gan, Chuang [3 ]
Han, Song [1 ]
机构
[1] MIT, 77 Massachusetts Ave, Cambridge, MA 02139 USA
[2] Univ Calif San Diego, San Diego, CA USA
[3] MIT IBM Watson Lab, Cambridge, MA USA
关键词
neural network; sparse update; on-device training; efficient finetuning;
D O I
10.1145/3613424.3614307
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
On-device learning and efficient fine-tuning enable continuous and privacy-preserving customization (e.g., locally fine-tuning large language models on personalized data). However, existing training frameworks are designed for cloud servers with powerful accelerators (e.g., GPUs, TPUs) and lack the optimizations for learning on the edge, which faces challenges of resource limitations and edge hardware diversity. We introduce PockEngine: a tiny, sparse and efficient engine to enable fine-tuning on various edge devices. PockEngine supports sparse backpropagation: it prunes the backward graph and sparsely updates the model with measured memory saving and latency reduction while maintaining the model quality. Secondly, PockEngine is compilation first: the entire training graph (including forward, backward and optimization steps) is derived at compile-time, which reduces the runtime overhead and brings opportunities for graph transformations. PockEngine also integrates a rich set of training graph optimizations, thus can further accelerate the training cost, including operator reordering and backend switching. PockEngine supports diverse applications, frontends and hardware backends: it flexibly compiles and tunes models defined in PyTorch/TensorFlow/Jax and deploys binaries to mobile CPU/GPU/DSPs. We evaluated PockEngine on both vision models and large language models. PockEngine achieves up to 15 x speedup over off-the-shelf TensorFlow (Raspberry Pi), 5.6 x memory saving back-propagation ( Jetson AGX Orin). Remarkably, PockEngine enables fine-tuning LLaMav2-7B on NVIDIA Jetson AGX Orin at 550 tokens/s, 7.9x faster than the PyTorch.
引用
收藏
页码:1381 / 1394
页数:14
相关论文
共 50 条
  • [1] Efficient Fine-Tuning of BERT Models on the Edge
    Vucetic, Danilo
    Tayaranian, Mohammadreza
    Ziaeefard, Maryam
    Clark, James J.
    Meyer, Brett H.
    Gross, Warren J.
    [J]. 2022 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS 22), 2022, : 1838 - 1842
  • [2] Fine-tuning
    不详
    [J]. AVIATION WEEK & SPACE TECHNOLOGY, 2001, 155 (02): : 21 - 21
  • [3] Fine-Tuning
    Manson, Neil A.
    [J]. TPM-THE PHILOSOPHERS MAGAZINE, 2019, (86): : 99 - 105
  • [4] Fine-tuning
    Rachel Smallridge
    [J]. Nature Reviews Molecular Cell Biology, 2004, 5 (2) : 79 - 79
  • [5] Fine-tuning
    不详
    [J]. MECHANICAL ENGINEERING, 2007, 129 (03) : 23 - 23
  • [6] Composable Sparse Fine-Tuning for Cross-Lingual Transfer
    Ansell, Alan
    Ponti, Edoardo Maria
    Korhonen, Anna
    Vulic, Ivan
    [J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 1778 - 1796
  • [7] How fine can fine-tuning be? Learning efficient language models
    Radiya-Dixit, Evani
    Wang, Xin
    [J]. INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 108, 2020, 108 : 2435 - 2442
  • [8] FINE-TUNING FINE CHEMICALS
    ROYSE, S
    [J]. EUROPEAN CHEMICAL NEWS, 1995, 64 (1693): : 28 - &
  • [9] FINE-TUNING THE FOIA
    KENNEDY, P
    [J]. COLUMBIA JOURNALISM REVIEW, 1984, 23 (03) : 8 - 9
  • [10] Fine-tuning subtitle
    [J]. Waste Age, 2002, 33 (11): : 32 - 46