Anyscale’s Post

View organization page for Anyscale, graphic

25,539 followers

We’ve recently contributed FP8 support to vLLM in collaboration with Neural Magic -- with this feature, you can see up to a 1.8x reduction in inter-token latency, with >99% accuracy preservation! A common concern with FP8 is whether users will experience accuracy degradation. To address this, Neural Magic has produced many checkpoints for key models with >99% accuracy preservation across a wide range of benchmarks (https://lnkd.in/gTimN5dZ), including: - Llama3-70b - Mixtral 8x7b - Llama3-8b You can easily try this out on vLLM, and read more about the feature here -- https://lnkd.in/gzKJqerB

  • No alternative text description for this image
Robert Shaw

Senior Director of Eng | Committer @vllm-project

2w

Fp8 and Int8 activation quantization is one of the best ways to reduce the costs of LLM deployments. #vLLM makes it easy to take advantage of these features Stay tuned for more performance enhancements as we integrate quantized attention and other graph level optimizations!

Robert Caulk

Founder @ Emergent Methods | AskNews.app

2w

Thabks for halving our compute requirements overnight! #ray is the way.

Like
Reply
See more comments

To view or add a comment, sign in

Explore topics