共 50 条
Serving DNNs like Clockwork: Performance Predictability from the Bottom Up
被引:0
|作者:
Gujarati, Arpan
[1
]
Karimi, Reza
[2
]
Alzayat, Safya
[1
]
Hao, Wei
[1
]
Kaufmann, Antoine
[1
]
Vigfusson, Ymir
[2
]
Mace, Jonathan
[1
]
机构:
[1] Max Planck Inst Software Syst, Saarbrucken, Germany
[2] Emory Univ, Atlanta, GA 30322 USA
来源:
基金:
美国国家科学基金会;
关键词:
TAIL;
D O I:
暂无
中图分类号:
TP31 [计算机软件];
学科分类号:
081202 ;
0835 ;
摘要:
Machine learning inference is becoming a core building block for interactive web applications. As a result, the underlying model serving systems on which these applications depend must consistently meet low latency targets. Existing model serving architectures use well-known reactive techniques to alleviate common-case sources of latency, but cannot effectively curtail tail latency caused by unpredictable execution times. Yet the underlying execution times are not fundamentally unpredictable-on the contrary we observe that inference using Deep Neural Network (DNN) models has deterministic performance. Here, starting with the predictable execution times of individual DNN inferences, we adopt a principled design methodology to successively build a fully distributed model serving system that achieves predictable end-to-end performance. We evaluate our implementation, Clockwork, using production trace workloads, and show that Clockwork can support thousands of models while simultaneously meeting 100 ms latency targets for 99.9999% of requests. We further demonstrate that Clockwork exploits predictable execution times to achieve tight request-level service-level objectives (SLOs) as well as a high degree of request-level performance isolation.
引用
收藏
页码:443 / 462
页数:20
相关论文