we recently moved to GKE autopilot. We are using kubernetes version 1.29.4 on a REGULAR release channel.
For most of the workloads, we run the with following nodeSelector.
For the first question, what are your Pod resource requests? I'm wondering if you're requesting so much that there's no pre-defined C3 machine type that can handle the size of the Pod.
For the second, Performance class is specifically a one-Pod-per-node compute class so that you can burst into the entire node at any time without worrying about competing with other Pods. The pod isolation label is probably supporting that, yes.
To spin up nodes in advance, you could deploy Pods with a low PriorityClass that don't do anything. They'd get evicted by your actual workload Pods if needed. https://cloud.google.com/kubernetes-engine/docs/how-to/capacity-provisioning has the instructions. BUT because of the Performance class pricing model you'll be paying for the idle node regardless of whether the small Pod is using it.
Thanks. now I understood the the reasoning behind 1 pod per node. Is there anyway we can still continue using performace nodes and deploy multiple pods on it so we dont see lag of node scaling up.
Just to make sure I am 100% clear on this. If I spin up a pod which requests performance node of e2 with request and limits of 50MB/50vcpu - 100MB/100vCPU, It would still spin up e2-medium (This is what we have seen always.) and rest of the resources will just be sitting idle and hence wasted. Correct?
Also I want to make sure that these resources are not shared with other GCP clients.
Not yet, but I believe that product teams are aware that it would be a good capability to have.
Yes, the unused resources would be idle if the machine size that was spun up was bigger than your Pod size and your Pod never burst into the extra capacity.
I'm...not sure what you mean by "shared with other GCP clients". Do you mean whether the underlying VM is dedicated to your project and workloads?
Yes, I meant to ask is the underlying VM is dedicated to Our project and workloads, I know they should be just wanted to be absolutely sure.
https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-security#security-boundaries will answer that question for you! From that page: all autopilot VMs are exclusive to your project, so other GCP user workloads don't run on the same VM.
Could you post your manifest as well, just to double check stuff?
apiVersion: v1
kind: Pod
metadata:
name: cleopatra-kaniko-pod-self-hosted-d465651
namespace: arc-runners
spec:
nodeSelector:
cloud.google.com/compute-class: Performance
cloud.google.com/gke-spot: "true"
cloud.google.com/machine-family: e2
containers:
- name: kaniko
resources:
limits:
cpu: 1000m
memory: 4Gi
limits:
cpu: 2000m
memory: 6Gi
image: gcr.io/kaniko-project/executor:latest
args:
- "XXXXXX"
volumeMounts:
- name: workspace
mountPath: "/workspace"
- name: modules-volume
mountPath: /cache
env:
- name: GOCACHE
value: "/cache"
serviceAccount: kaniko-sa
restartPolicy: Never
volumes:
- name: workspace
emptyDir: *** ***
- name: modules-volume
persistentVolumeClaim:
claimName: kaniko-modules-pvc
This is the sample manifest file. It basically runs Github workflows for building container images on self hosted runners hosted on GKE cluster.
Is it a typo that it says `resources.limits` twice? Does your actual manifest correctly specify `requests`?
Yeah My bad, bad copy paste. Yes there is lmit and requests specified.