Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference

被引:0
|
作者
Yao, Jinghan [1 ]
Alnaasan, Nawras [1 ]
Chen, Tian [1 ]
Shafi, Aamir [1 ]
Subramoni, Hari [1 ]
Panda, Dhabaleswar K. [1 ]
机构
[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA
关键词
Autoregressive model; Inference frameworks; Parallel Pipelining; Distributed inference;
D O I
10.1109/HiPC58850.2023.00026
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Autoregressive models, despite their commendable performance in a myriad of generative tasks, face challenges stemming from their inherently sequential structure. Inference on these models, by design, harnesses a temporal dependency, where the current token's probability distribution is conditioned on preceding tokens. This inherent characteristic severely impedes computational efficiency during inference as a typical inference request can require more than thousands of tokens, where generating each token requires a load of entire model weights, making the inference more memory-bound. The large overhead becomes profound in real deployment where requests arrive randomly, necessitating various generation lengths. Existing solutions, such as dynamic batching and concurrent instances, introduce significant response delays and bandwidth contention, falling short of achieving optimal latency and throughput. To address these shortcomings, we propose Flover - a temporal fusion framework for efficiently inferring multiple requests in parallel. We deconstruct the general generation pipeline into pre-processing and token generation, and equip the framework with a dedicated work scheduler for fusing the generation process temporally across all requests. By orchestrating the token-level parallelism, Flover exhibits optimal hardware efficiency and significantly spares the system resources. By further employing a fast buffer reordering algorithm that allows memory eviction of finished tasks, it brings over 11x inference speedup on GPT and 16x on LLAMA compared to the cutting-edge solutions provided by NVIDIA FasterTransformer. Crucially, by leveraging the advanced tensor parallel technique, Flover proves efficacious across diverse computational landscapes, from single-GPU setups to distributed scenarios, thereby offering robust performance optimization that adapts to variable use cases.
引用
收藏
页码:107 / 116
页数:10
相关论文
共 50 条
  • [31] Asymptotic inference for a one-dimensional simultaneous autoregressive model
    Sándor Baran
    Gyula Pap
    Metrika, 2011, 74 : 55 - 66
  • [32] Conditional Predictive Inference for Beta Regression Model with Autoregressive Errors
    Ferreira, Guillermo
    Paul Navarrete, Jean
    Castro, Luis M.
    de Castro, Mario
    INTERDISCIPLINARY BAYESIAN STATISTICS, EBEB 2014, 2015, 118 : 357 - 366
  • [33] Privacy-preserving parametric inference for spatial autoregressive model
    Wang, Zhijian
    Song, Yunquan
    TEST, 2024, 33 (03) : 877 - 896
  • [34] Weighted Likelihood Inference for a Mixed Regressive Spatial Autoregressive Model
    Gaetan, Carlo
    Greco, Luca
    DATA ANALYSIS AND CLASSIFICATION, 2010, : 407 - +
  • [35] Inference for the extended bifurcating autoregressive model for cell lineage studies
    Hugins, RM
    Basawa, IV
    AUSTRALIAN & NEW ZEALAND JOURNAL OF STATISTICS, 2000, 42 (04) : 423 - 432
  • [36] This is SPATEM! A Spatial-Temporal Optimization Framework for Efficient Inference on ReRAM-based CNN Accelerator
    Tsou, Yen-Ting
    Chen, Kuan-Hsun
    Yang, Chia-Lin
    Cheng, Hsiang-Yun
    Chen, Jian-Jia
    Tsai, Der-Yu
    27TH ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE, ASP-DAC 2022, 2022, : 702 - 707
  • [37] Empirical likelihood inference for threshold autoregressive conditional heteroscedasticity model
    Cuixin Peng
    Zhiwen Zhao
    Journal of Inequalities and Applications, 2021
  • [38] Efficient estimation of the semiparametric spatial autoregressive model
    Robinson, P. M.
    JOURNAL OF ECONOMETRICS, 2010, 157 (01) : 6 - 17
  • [39] Efficient and Parallel Framework for Analyzing the Sentiment
    Sharma, Ankur
    Nayak, Gopal Krishna
    PROCEEDINGS OF THE 5TH INTERNATIONAL CONFERENCE ON FRONTIERS IN INTELLIGENT COMPUTING: THEORY AND APPLICATIONS, FICTA 2016, VOL 1, 2017, 515 : 135 - 145
  • [40] FACETS - a Framework for Parallel Coupling of Fusion Components
    Cary, John R.
    Hakim, Ammar
    Miah, Mahmood
    Kruger, Scott
    Pletzer, Alexander
    Shasharina, Svetlana
    Vadlamani, Srinath
    Pankin, Alexei
    Cohen, Ronald
    Epperly, Tom
    Rognlien, Tom
    Groebner, Richard
    Balay, Satish
    McInnes, Lois
    Zhang, Hong
    PROCEEDINGS OF THE 18TH EUROMICRO CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING, 2010, : 435 - 442