Practical Inference Server Systems

针对Transformer从两个维度进行inference的加速，

Optimizations in FasterTransformer

Demonstration of the caching mechanism in the FasterTransformer library

Triton inference server system[2]

一个基于docker的ML Serving系统。支持Adaptive Batching（server-side），见下图。

../_images/batching-diagram.png

一种可以支持多种云服务的framework。

有点牛的。（没啥特别的，看了一下，感觉只支持TF，这没啥用啊）

支持非常多的metric，可以制定相关的service。

支持k8s，混合device。

KServe

支持aws。

Cloud Edge Inference Solution、支持Dynamic batching、支持model ensemble。（Nvidia的东西，有点牛的。）对于Large Model有优化。

挺有意思的

NVIDIA Technical Blog. 《Accelerated Inference for Large Transformer Models Using FasterTransformer and Triton Inference Server》, 2022年8月3日. https://developer.nvidia.com/blog/accelerated-inference-for-large-transformer-models-using-nvidia-fastertransformer-and-nvidia-triton-inference-server/.
《Triton Inference Server | NVIDIA Developer》. 见于 2022年8月18日. https://developer.nvidia.com/nvidia-triton-inference-server.
Kijko, Pawel. 《Best Tools to Do ML Model Serving》. neptune.ai, 2021年2月15日. https://neptune.ai/blog/ml-model-serving-best-tools.