Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reconciling Accuracy, Cost, and Latency of Inference Serving Systems

Reconciling Accuracy, Cost, and Latency of Inference Serving Systems

17th Cloud Control Workshop, Sweden, 2024
https://cloudresearch.org/workshops/17th/

Pooyan Jamshidi

June 26, 2024
Tweet

More Decks by Pooyan Jamshidi

Other Decks in Science

Transcript

  1. Reconciling Accuracy, Cost, and Latency of Inference Serving Systems Pooyan

    Jamshidi https://pooyanjamshidi.github.io/ University of South Carolina S
  2. Multi-Objective Optimization with Known Constraints under Uncertainty Solutions: InfAdapter [2023]:

    Autoscaling for ML Inference IPA [2024]: Autoscaling for ML Inference Pipeline Sponge [2024]: Autoscaling for ML Inference Pipeline Dynamic SLO Problem: Different Assumptions
  3. InfAdapter [2023]: Autoscaling for ML Model Inference IPA [2024]: Autoscaling

    for ML Inference Pipeline Sponge [2024]: Autoscaling for ML Inference Pipeline with Dynamic SLO
  4. InfAdapter: How? 8 Selecting a subset of model variants, each

    having its size meeting latency requirements for the predicted workload while maximizing accuracy and minimizing resource cost
  5. InfAdapter: Experimental evaluation setup Workload: Twitter-trace sample (2022-08) Baselines: Kubernetes

    VPA and Model-Switching Used models: Resnet18, Resnet34, Resnet50, Resnet101, Resnet152 Interval adaptation: 30 seconds Kubernetes cluster: 48 Cores, 192 GiB RAM 16
  6. Takeaway 27 Model variants provide the opportunity to reduce resource

    costs while adapting to the dynamic workload. Using a set of model variants simultaneously provides higher average accuracy compared to having one variant. Inference Serving Systems should consider accuracy, latency, and cost at the same time.
  7. Takeaway 28 Model variants provide the opportunity to reduce resource

    costs while adapting to the dynamic workload. Using a set of model variants simultaneously provides higher average accuracy compared to having one variant. Inference Serving Systems should consider accuracy, latency, and cost at the same time. InfAdapter!
  8. InfAdapter [2023]: Autoscaling for ML Model Inference IPA [2024]: Autoscaling

    for ML Inference Pipeline Sponge [2024]: Autoscaling for ML Inference Pipeline with Dynamic SLO
  9. Inference Pipeline 30 Video Decoder Stream Muxer Primary Detector Object

    Tracker Secondary Classifier # Configuration Options 55 86 14 44 86
  10. How to navigate the Accuracy/Latency trade-off space? Previous works, INFaaS

    and Model-Switch, have proven that there is a big latency- accuracy-resource footprint tradeoff of models trained for the same task.
  11. Model Serving Pipeline Is only scaling enough? ? X Snapshot

    of the System X Adaptivity to multiple objectives
  12. InfAdapter [2023]: Autoscaling for ML Model Inference IPA [2024]: Autoscaling

    for ML Inference Pipeline Sponge [2024]: Autoscaling for ML Inference Pipeline with Dynamic SLO
  13. Dynamic User -> Dynamic Network Bandwidths ˻ Users move ˻

    Fluctuations in the network bandwidths ˻ Reduced time-budget for processing requests 44 SLO network latency processing latency
  14. Dynamic User -> Dynamic Network Bandwidths ˻ Users move ˻

    Fluctuations in the network bandwidths ˻ Reduced time-budget for processing requests 45 SLO network latency processing latency
  15. Inference Serving Requirements 46 Highly Responsive! Cost-Efficient! Resource Scaling In-place

    Vertical Scaling Horizontal Scaling (end-to-end latency guarantee) (least resource consumption) (more responsive) (more cost efficient) Sponge!
  16. Vertical Scaling DL Model Profiling ˻ How much resource should

    be allocated to a DL model? ˻ Latency/batch size → linear relationship ˻ Latency/CPU allocation → inverse relationship 47
  17. 3 design choices: 1. In-place vertical scaling • Fast response

    time 2. Request reordering • High priority requests 3. Dynamic batching • Increase system utilization 52 System Design
  18. Evaluation SLO guarantees (99th percentile) with up to 20% resource

    save up compared to static resource allocation. Sponge source code: https://github.com/saeid93/sponge 53
  19. Future Directions 54 Resource Scaling In-place Vertical Scaling Horizontal Scaling

    (more responsive) (more cost efficient) Sponge! How can both scaling mechanisms be used jointly under a dynamic workload to be responsive and cost efficient while guaranteeing SLOs?
  20. Performance goals are competing and users have preferences over these

    goals The variability space (design space) of (composed) systems is exponentially increasing Systems operate in uncertain environments with imperfect and incomplete knowledge Goal: Enabling users to f ind the right quality tradeoff Lander Testbed (NASA) Turtlebot 3 (UofSC) Husky UGV (UofSC) CoBot (CMU)