Hello, we've been experiencing some strange behavior with our GKE Autopilot clusters lately. After deploying, several Pods end up in the “OutOfCPU” state. While the desired number of Pods do start up, many of them show this error. For example, if we need to start 20 Pods, some may fail initially (e.g., 10) and show the “OutOfCPU” state. Eventually, the 20 Pods start, but we end up with 30 Pods total, 10 of which have the error. In more extreme cases, we see 150+ Pods with this error.
What could be causing this issue?
Thank you,
Can you post the out put of `kubectl describe pod` for one of the OutOfcpu pods? Also, can you share the pod / deployment spec as well?
Unfortunately, we don't have any pods right now that show this error, so the describe does not make sense. However, I will provide an example of the deployment spec (some attributes have been removed for company confidentiality).
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: worker
name: worker
spec:
selector:
matchLabels:
app: worker
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 50%
maxUnavailable: 0
template:
metadata:
labels:
app: worker
spec:
shareProcessNamespace: true
serviceAccountName: ksa-worker
containers:
- image: gcr.io/cloudsql-docker/gce-proxy:latest
name: cloud-sql-proxy
command:
- "/cloud_sql_proxy"
- "-instances=$(DB_INSTANCE_CONNECTION_NAME)=tcp:5432"
securityContext:
runAsNonRoot: true
resources:
limits:
cpu: 0.25
memory: 256Mi
ephemeral-storage: 1Gi
requests:
cpu: 0.25
memory: 256Mi
ephemeral-storage: 1Gi
- image: us-east1-docker.pkg.dev/image_path
name: worker
command: ["/bin/sh", "-c"]
args:
- |
sleep 2s
exec bundle exec sidekiq
sidekiq_exit_code=$?
sql_proxy_pid=$(pgrep cloud_sql_proxy) && kill -INT $sql_proxy_pid && exit $sidekiq_exit_code
securityContext:
runAsUser: 0
capabilities:
add:
- SYS_PTRACE
resources:
limits:
cpu: 1
memory: 3Gi
ephemeral-storage: 10Gi
requests:
cpu: 1
memory: 3Gi
ephemeral-storage: 10Gi
Thanks,
We encountered the same issue once more, so I'm providing a description of a pod that displayed an "OutOfCPU" error. In my initial post, I shared the configuration of a worker we utilize. However, the latest error arose on a web server, so I will provide its description instead, though I don't believe it will be particularly relevant.
Name: webserserver-6df5b46fcb-2kfxk
Namespace: default
Priority: 0
Service Account: ksa-webserver-cloud-sql-prd
Node: gk3-cluster-platform-prd-nap-m5fvnmo2-b048b406-zt2q/
Start Time: Tue, 04 Jun 2024 13:07:03 -0400
Labels: app=webserserver
app.kubernetes.io/managed-by=google-cloud-deploy
deploy.cloud.google.com/delivery-pipeline-id=webserver-prd-pipeline
deploy.cloud.google.com/location=us-east1
deploy.cloud.google.com/project-id=webserver-a2b65e6a0q
deploy.cloud.google.com/release-id=webserver-release-7558b5bb5
deploy.cloud.google.com/target-id=prd
pod-template-hash=6df5b46fcb
skaffold.dev/run-id=2pm4c4hoa5ofm3a0b265a7ffkm2oiifpe6bbdob1j0osdaexyk6h
Annotations: secrets.doppler.com/secretsupdate.webserver-secret: W/"0f1460cae340e731ea7db6508f5c8e6659156e9c7e62a71c3300191fb675543a"
Status: Failed
Reason: OutOfcpu
Message: Pod was rejected: Node didn't have enough resource: cpu, requested: 2500, used: 5431, capacity: 7910
SeccompProfile: RuntimeDefault
IP:
IPs: <none>
Controlled By: ReplicaSet/webserserver-6df5b46fcb
Containers:
cloud-sql-proxy:
Image: gcr.io/cloudsql-docker/gce-proxy:1.29.0
Port: <none>
Host Port: <none>
Command:
/cloud_sql_proxy
-instances=$(DB_INSTANCE_CONNECTION_NAME)=tcp:5432
-term_timeout=40s
Limits:
cpu: 500m
ephemeral-storage: 966367641
memory: 1Gi
Requests:
cpu: 500m
ephemeral-storage: 966367641
memory: 1Gi
Environment Variables from:
webserver-secret Secret Optional: false
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dcl4z (ro)
webserserver:
Image: us-east1-docker.pkg.dev/webserver-a2b65e6a0q/webserver/webserver:7558b5bb581c659b10d387cd0bca53e94750346f
Port: <none>
Host Port: <none>
Command:
/bin/sh
-c
Args:
sleep 2s
exec bundle exec rails s
rails_exit_code=$?
sql_proxy_pid=$(pgrep cloud_sql_proxy) && kill -INT $sql_proxy_pid && exit $rails_exit_code
Limits:
cpu: 2
ephemeral-storage: 9760313180
memory: 2Gi
Requests:
cpu: 2
ephemeral-storage: 9760313180
memory: 2Gi
Liveness: http-get http://:3000/health_check delay=0s timeout=10s period=10s #success=1 #failure=3
Readiness: http-get http://:3000/health_check delay=0s timeout=10s period=10s #success=1 #failure=3
Environment Variables from:
webserver-secret Secret Optional: false
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dcl4z (ro)
Readiness Gates:
Type Status
cloud.google.com/load-balancer-neg-ready
Conditions:
Type Status
cloud.google.com/load-balancer-neg-ready
Volumes:
kube-api-access-dcl4z:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: kubernetes.io/arch=amd64:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 5m18s gke.io/optimize-utilization-scheduler Successfully assigned default/webserserver-6df5b46fcb-2kfxk to gk3-cluster-platform-prd-nap-m5fvnmo2-b048b406-zt2q
Warning OutOfcpu 5m18s kubelet Node didn't have enough resource: cpu, requested: 2500, used: 5431, capacity: 7910
Normal LoadBalancerNegNotReady 5m4s neg-readiness-reflector Waiting for pod to become healthy in at least one of the NEG(s): [...]
Hi @garisingh, any ideas about this?
Thanks