Anyscale’s Post

View organization page for Anyscale, graphic

25,539 followers

We’ve recently contributed FP8 support to vLLM in collaboration with Neural Magic -- with this feature, you can see up to a 1.8x reduction in inter-token latency, with >99% accuracy preservation! A common concern with FP8 is whether users will experience accuracy degradation. To address this, Neural Magic has produced many checkpoints for key models with >99% accuracy preservation across a wide range of benchmarks (https://lnkd.in/gTimN5dZ), including: - Llama3-70b - Mixtral 8x7b - Llama3-8b You can easily try this out on vLLM, and read more about the feature here -- https://lnkd.in/gzKJqerB

3 Comments

Robert Shaw

Senior Director of Eng | Committer @vllm-project

Fp8 and Int8 activation quantization is one of the best ways to reduce the costs of LLM deployments. #vLLM makes it easy to take advantage of these features Stay tuned for more performance enhancements as we integrate quantized attention and other graph level optimizations!

11 Reactions

Robert Caulk

Founder @ Emergent Methods | AskNews.app

Thabks for halving our compute requirements overnight! #ray is the way.

See more comments

To view or add a comment, sign in

More Relevant Posts

Neural Magic

15,866 followers
2w
Report this post
FP8 quantization is now available in vLLM - check it out! Quantized inference is one of the best ways to reduce the costs of LLM deployments.
Anyscale

25,539 followers
2w

We’ve recently contributed FP8 support to vLLM in collaboration with Neural Magic -- with this feature, you can see up to a 1.8x reduction in inter-token latency, with >99% accuracy preservation! A common concern with FP8 is whether users will experience accuracy degradation. To address this, Neural Magic has produced many checkpoints for key models with >99% accuracy preservation across a wide range of benchmarks (https://lnkd.in/gTimN5dZ), including: - Llama3-70b - Mixtral 8x7b - Llama3-8b You can easily try this out on vLLM, and read more about the feature here -- https://lnkd.in/gzKJqerB
Like Comment
To view or add a comment, sign in
Prasanna Kumar V

Conversational AI | Gen AI | LLM | NLP | kaggle 3x Expert
2mo
Report this post
Text Embedding Inference (TEI) to run embedding models released by Hugging Face team , using this we can run many embeddings models in much faster way for #rag applications https://lnkd.in/gkpJie4E

Text Embeddings Inference

huggingface.co
Like Comment
To view or add a comment, sign in
Wesley Mmadike

FrontEnd software developer | Telecommunication analytics
3w
Report this post
this is actually very impressive

Mpho Mokomiri Ephraim Shiang ll

Data Science Intern at Old Mutual | Building Intelligent Systems For Tomorrow |
3w Edited

Hi, Using a camera to control your laptop with machine learning – no mouse needed! Control it with your face and hands. Building Intelligent Systems for Tomorrow #MachineLearning #ComputerVision #CNN

1 Comment
Like Comment
To view or add a comment, sign in
Principal Engineering Mindset

5 followers
6mo
Report this post
5 Advantages of Support Vector Machine Algorithms. Tutorial by @finesse_ai https://lttr.ai/AMIDy #svm #machinelearning

lttr.ai

https://www.aifinesse.com
Like Comment
To view or add a comment, sign in
Glenn Jocher

Founder & CEO at Ultralytics | YOLOv8 🚀
4mo
Report this post
Exciting updates from the Ultralytics team! 🚀 We've just rolled out a series of enhancements to our RT-DETR models, aiming to boost export performance and flexibility for all users in https://lnkd.in/dbs4kQbj. 🌟 Key updates include: - Export to TorchScript and ONNX formats now possible! - Added missing class names to all RT-DETR models. 🚀 - More flexible RT-DETR model sources, no file extension checks required. 🌐 - Consistent prediction handling and improved validation across models. 🎯 Here's to making machine learning tasks smoother and more efficient for everyone! 🛠️👥 #Ultralytics #MachineLearning #YOLO #RTDETR #TechUpdates

`RTDETRDetectionModel` TorchScript, ONNX Predict and Val support by glenn-jocher · Pull Request #8818 · ultralytics/ultralytics

github.com
Like Comment
To view or add a comment, sign in
Bronia Badubi

Hr projects/ ERM/IOP candidate/ Gallup Certified Strength coach/ Trainer by BQA/ Positive institutions consultant
3w
Report this post
Organisations and businesses, it’s time we embrace machine learning for our talent acquisition strategy, especially for recruitment and selection and learning and development, gen Z skills are more diverse. It’s critical that we embrace these diverse skills and use to retain our gen Z …gone are the days where organisations use classroom training with trainers, gamification and machine learning are good ways of embracing the change in learning and development. #learninganddevelopment #talentaquisition #inclusion #generationZ #embracediversity

Mpho Mokomiri Ephraim Shiang ll

Data Science Intern at Old Mutual | Building Intelligent Systems For Tomorrow |
3w Edited

Hi, Using a camera to control your laptop with machine learning – no mouse needed! Control it with your face and hands. Building Intelligent Systems for Tomorrow #MachineLearning #ComputerVision #CNN
Like Comment
To view or add a comment, sign in
Ganesh Reddy Puli

Master's in Computer Science | AI/ML Enthusiast | Experienced in Machine Learning and NLP | Proficient in Python, R, SQL
5mo
Report this post
GPU's and modern CPU's implement Single Instruction, Multiple Data (SIMD) pipelines allowing multiple operations to be issued in parallel which is critical in Machine Learning where the data sets are often very large. Check out my new article : "The Need for Speed: vector vs for loop" on Medium. It talks about how NumPy effectively uses modern day libraries and makes better use of available data parallelism.

The Need for Speed: vector vs for loop

link.medium.com
Like Comment
To view or add a comment, sign in
Simon Villani, PhD

Lead Data Scientist at ANZ | AI and LLM Enthusiast
1mo
Report this post
This is a fantastic (if not obvious) result. Using LLMs together in an ensemble outperforms single models. Paper below: https://lnkd.in/ghSWhyGn
4 Comments
Like Comment
To view or add a comment, sign in
Vaibhav Srivastav

GPU poor @ Hugging Face
2mo
Report this post
it’s quite important to pick your models wisely: > mixtral*/ phi* overfit gsm8k > claude/ gpt/ gemini dont (closed api) > llama 3 70b instruct scores decently well in large open models category > mistral 7b / llama 2 13b / codellama 13b score decently for gpu poor models interestingly mistral 7b v0.1 scores much better than mixtral 8x7b - wonder what changed in their training/ sft dataset
6 Comments
Like Comment
To view or add a comment, sign in
Derrick Mwiti

Machine Learning Professional | Google D.E Machine Learning
11mo
Report this post
Low latency and high throughput are the most desired features of any ML pipeline. There are 3 ways to decrease memory utilization and inference time for a machine learning pipeline: The three methods for making ML pipelines are: 1. Bucketing sequences of different sizes to increase inference speed 2. Dynamic batch size for maximum resource utilization 3. Hosting multiple models on the same engine for optimal work sharing of the CPU Discover how to make ML pipelines more efficient with bucketing in this article in the comments.

2 Comments
Like Comment
To view or add a comment, sign in

25,539 followers

View Profile Follow

Anyscale’s Post

More Relevant Posts

Explore topics