We’ve recently contributed FP8 support to vLLM in collaboration with Neural Magic -- with this feature, you can see up to a 1.8x reduction in inter-token latency, with >99% accuracy preservation! A common concern with FP8 is whether users will experience accuracy degradation. To address this, Neural Magic has produced many checkpoints for key models with >99% accuracy preservation across a wide range of benchmarks (https://lnkd.in/gTimN5dZ), including: - Llama3-70b - Mixtral 8x7b - Llama3-8b You can easily try this out on vLLM, and read more about the feature here -- https://lnkd.in/gzKJqerB
Thabks for halving our compute requirements overnight! #ray is the way.
Senior Director of Eng | Committer @vllm-project
2wFp8 and Int8 activation quantization is one of the best ways to reduce the costs of LLM deployments. #vLLM makes it easy to take advantage of these features Stay tuned for more performance enhancements as we integrate quantized attention and other graph level optimizations!