Re: Why is GKE autopilot node scaling is unreliabl...

ojasgo · 06-14-2024 07:53 AM

we recently moved to GKE autopilot. We are using kubernetes version 1.29.4 on a REGULAR release channel.

For most of the workloads, we run the with following nodeSelector.

nodeSelector:

cloud.google.com/compute-class: Performance

cloud.google.com/gke-spot: 'true'

cloud.google.com/machine-family: c3

Ideally, it should spin up node within minutes, but we often see even after half hour no new node is added. When Describing the pod I see this Event.

```

pod didn't trigger scale-up (it wouldn't fit if a new node is added): 24 node(s) didn't match Pod's node affinity/selector, 1 node(s) had untolerated taint {cloud.google.com/gke-quick-remove: true}

```

My question is what am I missing here? Is there something we can do to reliably scale up nodes to quickly run our workloads?

Also, when deploying a pod I see GKE warden adds bunch of extra fields to the pod such as

`cloud.google.com/pod-isolation: '2'`

what does this field mean? why is it added ?

Another issue/feature I see is, Every pod we deploy is spun up on a new node, making the creation of new pods slower, as node needs to be scaled up first. Is this because of `cloud.google.com/pod-isolation: '2'` annotation?

shannduin

For the first question, what are your Pod resource requests? I'm wondering if you're requesting so much that there's no pre-defined C3 machine type that can handle the size of the Pod.

For the second, Performance class is specifically a one-Pod-per-node compute class so that you can burst into the entire node at any time without worrying about competing with other Pods. The pod isolation label is probably supporting that, yes.

To spin up nodes in advance, you could deploy Pods with a low PriorityClass that don't do anything. They'd get evicted by your actual workload Pods if needed. https://cloud.google.com/kubernetes-engine/docs/how-to/capacity-provisioning has the instructions. BUT because of the Performance class pricing model you'll be paying for the idle node regardless of whether the small Pod is using it.

ojasgo

Thanks. now I understood the the reasoning behind 1 pod per node. Is there anyway we can still continue using performace nodes and deploy multiple pods on it so we dont see lag of node scaling up.

Just to make sure I am 100% clear on this. If I spin up a pod which requests performance node of e2 with request and limits of 50MB/50vcpu - 100MB/100vCPU, It would still spin up e2-medium (This is what we have seen always.) and rest of the resources will just be sitting idle and hence wasted. Correct?

Also I want to make sure that these resources are not shared with other GCP clients.

shannduin

Not yet, but I believe that product teams are aware that it would be a good capability to have.

Yes, the unused resources would be idle if the machine size that was spun up was bigger than your Pod size and your Pod never burst into the extra capacity.

I'm...not sure what you mean by "shared with other GCP clients". Do you mean whether the underlying VM is dedicated to your project and workloads?

ojasgo

Yes, I meant to ask is the underlying VM is dedicated to Our project and workloads, I know they should be just wanted to be absolutely sure.

shannduin

https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-security#security-boundaries will answer that question for you! From that page: all autopilot VMs are exclusive to your project, so other GCP user workloads don't run on the same VM.

ojasgo

cpu: 2000m

memory: 6Gi

This is what I am requesting for c3 nodes. but it wont scale up 7 out of 10 times.

I have also seen an error along the lines of

GCP out of resources. Is this because no spot instances of c3 are avail in, say, us-central-1a ?

shannduin

Could you post your manifest as well, just to double check stuff?

ojasgo

apiVersion: v1
kind: Pod
metadata:
  name: cleopatra-kaniko-pod-self-hosted-d465651
  namespace: arc-runners
spec:
  nodeSelector:
    cloud.google.com/compute-class: Performance
    cloud.google.com/gke-spot: "true"
    cloud.google.com/machine-family: e2
  containers:
  - name: kaniko
    resources:
      limits:
        cpu: 1000m
        memory: 4Gi
      limits:
        cpu: 2000m
        memory: 6Gi
    image: gcr.io/kaniko-project/executor:latest
    args:
      - "XXXXXX"
    volumeMounts:
    - name: workspace
      mountPath: "/workspace"  
    - name: modules-volume
      mountPath: /cache
    env:
      - name: GOCACHE
        value: "/cache"
  serviceAccount: kaniko-sa
  restartPolicy: Never
  volumes:
  - name: workspace
    emptyDir: *** ***
  - name: modules-volume
    persistentVolumeClaim:
      claimName: kaniko-modules-pvc

This is the sample manifest file. It basically runs Github workflows for building container images on self hosted runners hosted on GKE cluster.

shannduin

Is it a typo that it says `resources.limits` twice? Does your actual manifest correctly specify `requests`?

ojasgo

Yeah My bad, bad copy paste. Yes there is lmit and requests specified.

Why is GKE autopilot node scaling is unreliable.