Re: Pods with "Outofcpu" error in GKE Autopilot

vfdev · 05-25-2024 06:08 PM

Hello, we've been experiencing some strange behavior with our GKE Autopilot clusters lately. After deploying, several Pods end up in the “OutOfCPU” state. While the desired number of Pods do start up, many of them show this error. For example, if we need to start 20 Pods, some may fail initially (e.g., 10) and show the “OutOfCPU” state. Eventually, the 20 Pods start, but we end up with 30 Pods total, 10 of which have the error. In more extreme cases, we see 150+ Pods with this error.

What could be causing this issue?

Thank you,

garisingh

Can you post the out put of `kubectl describe pod` for one of the OutOfcpu pods? Also, can you share the pod / deployment spec as well?

vfdev

Unfortunately, we don't have any pods right now that show this error, so the describe does not make sense. However, I will provide an example of the deployment spec (some attributes have been removed for company confidentiality).

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: worker
  name: worker
spec:
  selector:
    matchLabels:
      app: worker
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 50%
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: worker
    spec:
      shareProcessNamespace: true
      serviceAccountName: ksa-worker
      containers:
        - image: gcr.io/cloudsql-docker/gce-proxy:latest
          name: cloud-sql-proxy
          command:
            - "/cloud_sql_proxy"
            - "-instances=$(DB_INSTANCE_CONNECTION_NAME)=tcp:5432"
          securityContext:
            runAsNonRoot: true
          resources:
            limits:
              cpu: 0.25
              memory: 256Mi
              ephemeral-storage: 1Gi
            requests:
              cpu: 0.25
              memory: 256Mi
              ephemeral-storage: 1Gi
        - image: us-east1-docker.pkg.dev/image_path
          name: worker
          command: ["/bin/sh", "-c"]
          args:
            - |
              sleep 2s
              exec bundle exec sidekiq
              sidekiq_exit_code=$?
              sql_proxy_pid=$(pgrep cloud_sql_proxy) && kill -INT $sql_proxy_pid && exit $sidekiq_exit_code
          securityContext:
            runAsUser: 0
            capabilities:
              add:
                - SYS_PTRACE
          resources:
            limits:
              cpu: 1
              memory: 3Gi
              ephemeral-storage: 10Gi
            requests:
              cpu: 1
              memory: 3Gi
              ephemeral-storage: 10Gi

Thanks,

vfdev

We encountered the same issue once more, so I'm providing a description of a pod that displayed an "OutOfCPU" error. In my initial post, I shared the configuration of a worker we utilize. However, the latest error arose on a web server, so I will provide its description instead, though I don't believe it will be particularly relevant.

Name:             webserserver-6df5b46fcb-2kfxk
Namespace:        default
Priority:         0
Service Account:  ksa-webserver-cloud-sql-prd
Node:             gk3-cluster-platform-prd-nap-m5fvnmo2-b048b406-zt2q/
Start Time:       Tue, 04 Jun 2024 13:07:03 -0400
Labels:           app=webserserver
                  app.kubernetes.io/managed-by=google-cloud-deploy
                  deploy.cloud.google.com/delivery-pipeline-id=webserver-prd-pipeline
                  deploy.cloud.google.com/location=us-east1
                  deploy.cloud.google.com/project-id=webserver-a2b65e6a0q
                  deploy.cloud.google.com/release-id=webserver-release-7558b5bb5
                  deploy.cloud.google.com/target-id=prd
                  pod-template-hash=6df5b46fcb
                  skaffold.dev/run-id=2pm4c4hoa5ofm3a0b265a7ffkm2oiifpe6bbdob1j0osdaexyk6h
Annotations:      secrets.doppler.com/secretsupdate.webserver-secret: W/"0f1460cae340e731ea7db6508f5c8e6659156e9c7e62a71c3300191fb675543a"
Status:           Failed
Reason:           OutOfcpu
Message:          Pod was rejected: Node didn't have enough resource: cpu, requested: 2500, used: 5431, capacity: 7910
SeccompProfile:   RuntimeDefault
IP:               
IPs:              <none>
Controlled By:    ReplicaSet/webserserver-6df5b46fcb
Containers:
  cloud-sql-proxy:
    Image:      gcr.io/cloudsql-docker/gce-proxy:1.29.0
    Port:       <none>
    Host Port:  <none>
    Command:
      /cloud_sql_proxy
      -instances=$(DB_INSTANCE_CONNECTION_NAME)=tcp:5432
      -term_timeout=40s
    Limits:
      cpu:                500m
      ephemeral-storage:  966367641
      memory:             1Gi
    Requests:
      cpu:                500m
      ephemeral-storage:  966367641
      memory:             1Gi
    Environment Variables from:
      webserver-secret  Secret  Optional: false
    Environment:           <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dcl4z (ro)
  webserserver:
    Image:      us-east1-docker.pkg.dev/webserver-a2b65e6a0q/webserver/webserver:7558b5bb581c659b10d387cd0bca53e94750346f
    Port:       <none>
    Host Port:  <none>
    Command:
      /bin/sh
      -c
    Args:
      sleep 2s
      exec bundle exec rails s
      rails_exit_code=$?
      sql_proxy_pid=$(pgrep cloud_sql_proxy) && kill -INT $sql_proxy_pid && exit $rails_exit_code
      
    Limits:
      cpu:                2
      ephemeral-storage:  9760313180
      memory:             2Gi
    Requests:
      cpu:                2
      ephemeral-storage:  9760313180
      memory:             2Gi
    Liveness:             http-get http://:3000/health_check delay=0s timeout=10s period=10s #success=1 #failure=3
    Readiness:            http-get http://:3000/health_check delay=0s timeout=10s period=10s #success=1 #failure=3
    Environment Variables from:
      webserver-secret  Secret  Optional: false
    Environment:           <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dcl4z (ro)
Readiness Gates:
  Type                                       Status
  cloud.google.com/load-balancer-neg-ready    
Conditions:
  Type                                       Status
  cloud.google.com/load-balancer-neg-ready    
Volumes:
  kube-api-access-dcl4z:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 kubernetes.io/arch=amd64:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                   Age    From                                   Message
  ----     ------                   ----   ----                                   -------
  Normal   Scheduled                5m18s  gke.io/optimize-utilization-scheduler  Successfully assigned default/webserserver-6df5b46fcb-2kfxk to gk3-cluster-platform-prd-nap-m5fvnmo2-b048b406-zt2q
  Warning  OutOfcpu                 5m18s  kubelet                                Node didn't have enough resource: cpu, requested: 2500, used: 5431, capacity: 7910
  Normal   LoadBalancerNegNotReady  5m4s   neg-readiness-reflector                Waiting for pod to become healthy in at least one of the NEG(s): [...]

vfdev

Hi @garisingh, any ideas about this?

Thanks