Reconciling Accuracy, Cost, and Latency of Inference Serving Systems

Reconciling Accuracy, Cost, and Latency of Inference Serving Systems Pooyan
Jamshidi https://pooyanjamshidi.github.io/ University of South Carolina S

Multi-Objective Optimization with Known Constraints under Uncertainty Solutions: InfAdapter [2023]:
Autoscaling for ML Inference IPA [2024]: Autoscaling for ML Inference Pipeline Sponge [2024]: Autoscaling for ML Inference Pipeline Dynamic SLO Problem: Different Assumptions

InfAdapter [2023]: Autoscaling for ML Model Inference IPA [2024]: Autoscaling
for ML Inference Pipeline Sponge [2024]: Autoscaling for ML Inference Pipeline with Dynamic SLO

Quality adaptation 5 ResNet18: Tiger ResNet152: Dog

InfAdapter: Why? Different throughputs with different model variants 6

InfAdapter: Why? Higher average accuracy by using multiple model variants
7

InfAdapter: How? 8 Selecting a subset of model variants, each
having its size meeting latency requirements for the predicted workload while maximizing accuracy and minimizing resource cost

InfAdapter: Formulation 9

InfAdapter: Formulation 10 Maximizing Average Accuracy

InfAdapter: Formulation 11 Maximizing Average Accuracy Minimizing Resource and Loading
Costs

InfAdapter: Formulation 12

InfAdapter: Formulation 13 Supporting incoming workload

InfAdapter: Formulation 14 Supporting incoming workload Guaranteeing end-to-end latency

InfAdapter: Design 15

InfAdapter: Experimental evaluation setup Workload: Twitter-trace sample (2022-08) Baselines: Kubernetes
VPA and Model-Switching Used models: Resnet18, Resnet34, Resnet50, Resnet101, Resnet152 Interval adaptation: 30 seconds Kubernetes cluster: 48 Cores, 192 GiB RAM 16

Workload Pattern 17

InfAdapter: P99-Latency evaluation 18

InfAdapter: Accuracy evaluation 23

24 InfAdapter: Cost evaluation

InfAdapter: Tradeoff Space 25

Takeaway 26 Inference Serving Systems should consider accuracy, latency, and
cost at the same time.

Takeaway 27 Model variants provide the opportunity to reduce resource
costs while adapting to the dynamic workload. Using a set of model variants simultaneously provides higher average accuracy compared to having one variant. Inference Serving Systems should consider accuracy, latency, and cost at the same time.

Takeaway 28 Model variants provide the opportunity to reduce resource
costs while adapting to the dynamic workload. Using a set of model variants simultaneously provides higher average accuracy compared to having one variant. Inference Serving Systems should consider accuracy, latency, and cost at the same time. InfAdapter!

Inference Pipeline 30 Video Decoder Stream Muxer Primary Detector Object
Tracker Secondary Classifier # Configuration Options 55 86 14 44 86

Is only scaling enough? ?

How to navigate the Accuracy/Latency trade-off space? Previous works, INFaaS
and Model-Switch, have proven that there is a big latency- accuracy-resource footprint tradeoff of models trained for the same task.

33 Goal: Providing a flexible inference pipeline

34 Snapshot of the System

Search Space

Problem Formulation Accuracy Objective Resource Objective Batch Control

Problem Formulation Latency SLA Throughput Constraint One active model per
node

38 Evaluation https://github.com/reconfigurable-ml-pipeline/ipa

39 Audio + QA Pipeline

40 Adaptivity to multiple objectives

Full replication package is available https://github.com/recon fi gurable-ml-pipeline

Model Serving Pipeline Is only scaling enough? ? X Snapshot
of the System X Adaptivity to multiple objectives

Dynamic User -> Dynamic Network Bandwidths ˻ Users move ˻
Fluctuations in the network bandwidths ˻ Reduced time-budget for processing requests 44 SLO network latency processing latency

Dynamic User -> Dynamic Network Bandwidths ˻ Users move ˻
Fluctuations in the network bandwidths ˻ Reduced time-budget for processing requests 45 SLO network latency processing latency

Inference Serving Requirements 46 Highly Responsive! Cost-Efficient! Resource Scaling In-place
Vertical Scaling Horizontal Scaling (end-to-end latency guarantee) (least resource consumption) (more responsive) (more cost efficient) Sponge!

Vertical Scaling DL Model Profiling ˻ How much resource should
be allocated to a DL model? ˻ Latency/batch size → linear relationship ˻ Latency/CPU allocation → inverse relationship 47

Problem Formulation 48

Problem Formulation 49 Minimize resource costs

Problem Formulation 50 Limit the batch size to grow infinitely!
Minimize resource costs

Problem Formulation 51 Limit the batch size to grow infinitely!
Minimize resource costs

3 design choices: 1. In-place vertical scaling • Fast response
time 2. Request reordering • High priority requests 3. Dynamic batching • Increase system utilization 52 System Design

Evaluation SLO guarantees (99th percentile) with up to 20% resource
save up compared to static resource allocation. Sponge source code: https://github.com/saeid93/sponge 53

Future Directions 54 Resource Scaling In-place Vertical Scaling Horizontal Scaling
(more responsive) (more cost efficient) Sponge! How can both scaling mechanisms be used jointly under a dynamic workload to be responsive and cost efficient while guaranteeing SLOs?

Performance goals are competing and users have preferences over these
goals The variability space (design space) of (composed) systems is exponentially increasing Systems operate in uncertain environments with imperfect and incomplete knowledge Goal: Enabling users to f ind the right quality tradeoff Lander Testbed (NASA) Turtlebot 3 (UofSC) Husky UGV (UofSC) CoBot (CMU)

Thank you, Saeid Ghafouri! 56

Reconciling Accuracy, Cost, and Latency of Inference Serving Systems

Reconciling Accuracy, Cost, and Latency of Inference Serving Systems

More Decks by Pooyan Jamshidi

Other Decks in Science

Featured

Transcript