Evaluating Kubernetes Behavior during Resource Exhaustion

Published on June 08, 2021

Table of Contents

Introduction
Node Conditions
Test Setup
Cluster Setup
Creating Node Failures
Conclusions

Introduction

Kubernetes is often pitched as a platform that is able to effectively run fault-tolerant distributed workloads.

While that’s true, it’s up to the developer to provide enough hints to the scheduler to allow it to do its job. Configuring these things can be confusing and unintuitive.

In this post, I will explore the Kubernetes concepts of Pod Disruption Budgets, Priority Classes, Resource Quotas, and Quality of Service classes to provide suggestions for what a production-ready configuration should look like for a Kubernetes application that requires high availability.

Pod Disruption Budgets are Kubernetes objects that inform the Kubernetes API server of the minimum availability or maximum set of pods that can safely be out due to a disruption on the node. Pod Priority Classes indicate the importance of pods relative to other pods for eviction and scheduling. Quality of Service classes indicate how the Kubernetes API server should plan for allocation and removing resources from Kubernetes nodes for effective bin packing and disruption eviction. These three resources appear closely intertwined, so it is helpful to understand exactly what is being indicated when these resources are configured on the cluster. We want to evaluate how to appropriately segregate and configure your Kubernetes objects to handle system and node-level outages.

Disclaimer: We’re going to focus on the challenges of handling node-level issues from an application perspective. A single application experiencing significant over-consumption of resources can create a node-level impact, but this post is not going to focus on that level of application configuration. The specific goal here is to provide suggestions for Kubernetes resources that will help an application stay stable in the face of node-level issues or outages.

Node Conditions

The Kubernetes node API resource provides a set of conditions that inform operators of the state of the node itself. In a normal operating state, these conditions will all provide a positive status message.

$ kubectl describe node worker-0
FrequentKubeletRestart        False   NoFrequentKubeletRestart        kubelet is functioning properly
FrequentDockerRestart         False   NoFrequentDockerRestart         docker is functioning properly
FrequentContainerdRestart     False   NoFrequentContainerdRestart     containerd is functioning properly
KernelDeadlock                False   KernelHasNoDeadlock             kernel has no deadlock
ReadonlyFilesystem            False   FilesystemIsNotReadOnly         Filesystem is not read-only
CorruptDockerOverlay2         False   NoCorruptDockerOverlay2         docker overlay2 is functioning properly
FrequentUnregisterNetDevice   False   NoFrequentUnregisterNetDevice   node is functioning properly
NetworkUnavailable            False   RouteCreated                    RouteController created a route
MemoryPressure                False   KubeletHasSufficientMemory      kubelet has sufficient memory available
DiskPressure                  False   KubeletHasNoDiskPressure        kubelet has no disk pressure
PIDPressure                   False   KubeletHasSufficientPID         kubelet has sufficient PID available
Ready                         True    KubeletReady                    kubelet is posting ready status. AppArmor enabled

Note: This set of conditions can vary based on the Kubernetes implementation.

Each of these conditions can set a specific taint on the node. Taints can be set on a node similarly to labels, but operate in conjunction with the eviction manager. The taint associated with this node condition will kick off an eviction process on the node. Each eviction condition comes with a default timeout that pods will be evicted during that period. For node unavailability, the eviction timeout is set to 5 minutes unless it is overruled by a toleration on the pod itself.

When these conditions occur, a series of calculations are made by the manager that attempts to determine which pod evictions would be most likely to resolve the current issue on the node. These calculations take inputs from pod priority, quality of service class, resource requests and limits, and current resource consumption. In this experiment, I will tune these settings to validate how Kubernetes executes this calculation and use that result to provide some recommendations for running fault-resistant applications on the platform.

Test Setup

In order to set up this experiment, we need to configure a set of deployments using these API resources. First, I created a deployment using the highest possible custom priority class available for users to consume:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
 name: high-priority
value: 1000000000
globalDefault: false
description: "Highest user priority"
preemptionPolicy: Never
---
kind: Deployment
metadata:
 name: pod-priority-high-deployment
spec:
 replicas: 2
 template:
  spec:
     priorityClassName: high-priority
     containers:
     - name: nginx
       image: nginx:1.14.2
       ports:
       - containerPort: 80
       resources:
         limits:
           memory: 100Mi
           cpu: 100m
         requests:
           memory: 25Mi
           cpu: 25m

Note: I will skip writing out API fields that are not relevant to the specific task at hand.

This deployment will have the highest possible pod priority of all non-system pods and the QoS class of Burstable.

Next, I set up a deployment that has the highest QoS class of Guaranteed. This Quality of Service class creates a specific reservation on the node which gives it constant access to a static set CPU and memory. Guaranteed QoS class can inform the Kubelet that as long as a pod is not exceeding its requests, it is behaving appropriately and should not be removed from the node.

apiVersion: apps/v1
kind: Deployment
metadata:
 name: guaranteed-qos
spec:
 replicas: 2
 template:
   spec:
     containers:
     - name: nginx
       image: nginx:1.14.2
       ports:
       - containerPort: 80
       resources:
         limits:
           memory: 100Mi
           cpu: 100m
         requests:
           memory: 100Mi
           cpu: 100m

Though the pod itself does not specify the QoS class, we can see it in the resultant pod status field.

status:
 hostIP: 10.240.0.21
 phase: Running
 podIP: 10.36.0.1
 qosClass: Guaranteed

QoS class cannot be specified, the API server calculates the field based on the configured resource requests and limits. This is done because the QoS depends on how Kubernetes can allocate host resources to the pod. Since this pod requests a static set of resources, that is, an amount of CPU and memory that does not need to scale beyond what it requests from the host, the Kubelet sets aside this resource request specifically for the deployment.

Finally, set up a pod disruption budget and a deployment with matching labels on this pod disruption budget:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
 name: workload-pdb
spec:
 minAvailable: 2
 selector:
   matchLabels:
     app: pdb-app
---
apiVersion: apps/v1
kind: Deployment
metadata:
 name: pdb-deployment
 labels:
   app: pdb-app
spec:
 replicas: 2
 selector:
   matchLabels:
     app: pdb-app
 template:
   metadata:
     labels:
       app: pdb-app
   spec:
     containers:
     - name: pdb
       image: nginx:1.14.2
       ports:
       - containerPort: 80
       resources:
         limits:
           memory: 110Mi
           cpu: 250m
         requests:
           memory: 25Mi
           cpu: 70m

With these deployments configured, we are now ready to set up a test environment and deploy these workloads.

Name	QoS Class	Priority Value	Disruption Budget	Replicas
guaranteed-deploy	Guaranteed	Default: 0	None	2
pdb-deploy	Burstable	Default: 0	minAvailable: 2	2
pod-priority-high-deployment	Burstable	1000000000	None	2

Cluster Setup

I used two different environments for this test. First, I set up a default GKE cluster with two node pools to isolate the GKE default system resources:

$ kubectl get nodes
NAME                                               STATUS  ROLES   AGE  VERSION
gke-resource-testing-default-pool-7ca65680-fjjs    Ready   <none>  58m  v1.19.9-gke.1900
gke-resource-testing-default-pool-7ca65680-rsd3    Ready   <none>  58m  v1.19.9-gke.1900
gke-resource-testing-scheduling-pool-08d0efc8-0cw5 Ready   <none>  47m  v1.19.9-gke.1900
gke-resource-testing-scheduling-pool-08d0efc8-563z Ready   <none>  47m  v1.19.9-gke.1900

In this cluster, our workloads will target the “scheduling” node pool. In order to focus our test on those nodewe must pin the workloads to those nodes. This is done with node selectors and anti-affinity rules in the Deployment podTemplate specification:

   spec:
     nodeSelector:
       cloud.google.com/gke-nodepool: scheduling-pool
     affinity:
       podAntiAffinity:
         requiredDuringSchedulingIgnoredDuringExecution:
         - labelSelector:
             matchExpressions:
             - key: app
               operator: In
               values:
               - pod-priority-preempt
           topologyKey: "kubernetes.io/hostname"

These two settings force pods onto the two separate nodes of the scheduling pool. While nodeSelectors may be essential for placing your workloads on the appropriate nodes in the cluster, these settings are not parameters for the test and remained fixed throughout.

I also set up a cluster on GCE nodes using kubeadm. This cluster also had two target nodes for running this test, and given the small set of resources configured in kube-system by the kubeadm tool, I moved those resources to the system node:

$ kubectl get nodes
NAME           STATUS   ROLES                  AGE     VERSION
controller-0   Ready    control-plane,master   18m     v1.21.1
worker-0       Ready    <none>                 18m     v1.21.1
worker-1       Ready    <none>                 18m     v1.21.1

Creating Node Failures

Now it’s time for the fun part, creating problems!

I wanted to specifically simulate three node failure conditions: memory pressure, disk pressure, and an unreachable node. In theory, these three conditions should have slightly different behaviors based on their eviction thresholds, configuration settings, and pod resource consumption.

I decided to use stress-ng as my main tool as it is capable of simulate many different resource consuption types. stress-ng is very good at creating CPU, disk, and i/o pressure on a node. Unfortunately, it has a hard time creating memory pressure as the tool operates on memory boundaries but does not reserve them with respect to the bounding cgroup, which Kubernetes uses to track memory consumption. In order to simulate memory pressure, I used some of the techniques mentioned on this Unix Stackexchange question. Specifically, the mechanism of continually increasing memory usage by running:

$ head -c 1000m /dev/zero | pv -L 10m | tail

This was an excellent little script as it allowed me to configure how much memory I wanted to consume in total (give me all the remaining memory!) while also slowly approaching a memory exhaustion condition that will require the system to choose some workloads to evict.

Finally, to create disk pressure conditions on the kubelet, I needed to create disk fill conditions on the volume that the kubelet was running in. This is generally located in /var/lib/kubelet. I used fallocate to create increasingly larger files to fill the volume containing that directory on the Kubernetes host.

At first, I wanted to be able to quickly cycle through all of the Kubernetes nodes, so I figured the Krew plugin node-shell would be a great way to quickly stand up pods without worrying about the SSH daemon on the remote host. The problem is that we are creating scheduling and eviction conditions on the node that we are trying to execute commands on, so the node-shell tool often becomes the first pod to get evicted. This is still a great tool for diagnosing node level issues when you have permissions to run a highly privileged container (hopefully not given to all users), but it is not great when the kubelet has to decide which process is responsible for the largest amount of resource consumption.

Disk Pressure

Let’s create some failures! We have the tools to exhibit resource exhaustion on the node and it’s time to see what happens. Fortunately, the kubelet provides great error messaging for such a situation. I created a large file in /var/lib/kubelet to fill disk space and very quickly thereafter got this log message:

worker-1 kubelet[53014]: eviction_manager.go:350]
"Eviction manager: must evict pod(s) to reclaim" resourceName="ephemeral-storage"

What happens after the eviction manager notices that the resource is exhausted on the node? It tries to reclaim the specific resource by removing things specifically targeted for that resource. In the case of disk space, it first attempts to garbage collect unused container images, as seen by the kubelet event:

20m  Warning   ImageGCFailed node/worker-1 
failed to garbage collect required amount of images. Wanted to free 8850644992 bytes, but freed 0 bytes

You will notice I executed this test on the kubeadm cluster. The default GKE cluster has some protection against these disk fill conditions by creating a read only filesystem on the node. This means that it is a little more challenging to create these disk fill conditions. Rather than wrestle GKE, I had the GKE kubeadm cluster ready to go to easily fill up whatever disk I mounted.

It is important to be able to understand the symptoms that outline these node failure events so that you can catch them quickly and root out the problem. Kubelet logs and Kubernetes API events can provide a lot of useful information about what is happening in the cluster.

One of the key places you won’t see events immediately is in the Kubernetes pod logs. This is why it is so important to have multiple sources of data when operating Kubernetes in production. Let’s look at some additional messaging.

QoS Classes

From the Kubernetes events:

worker-1 kubelet[53014]: eviction_manager.go:368] "Eviction manager: pods ranked for eviction" 
pods=[default/pdb-deployment-865c85586f-qdn6t default/pod-priority-high-deployment-7bf444b997-dr7g2 
default/guaranteed-deploy-585f7895f9-bqkjp 
kube-system/weave-net-mdd7x kube-system/kube-proxy-md68d]
worker-1 kubelet[53014]: eviction_manager.go:560] 
"Eviction manager: cannot evict a critical pod"

This gives us all the information we need to start figuring out what the eviction manager is doing for us. It tells us the result of its calculation, giving us an understanding of how it prioritizes these pods when the resource it is fighting for is not available to be reclaimed from the pods themselves.

In this situation, the pod disruption budget is not used, so with a Burstable QoS and the default pod priority, this pod is the first to be evicted. The high priority burstable QoS comes next in eviction order. Although its priority is high, it is not able to outweigh the power of the Guaranteed QoS class. Next comes the Guaranteed QoS pod. Finally, are the two kube-system daemonset pods that have been deployed on the node.

The second log line is also very interesting, because it tells us how the eviction operates when it hits a system pod, such as our network CNI. It does attempt to evict a daemonset, but it cannot evict a system critical pod. How are these pods delineated within the cluster?

System Priority Classes

Kubernetes clusters come with two built in pod priority classes:

NAME                      VALUE        GLOBAL-DEFAULT   AGE
system-cluster-critical   2000000000   false            6d21h
system-node-critical      2000001000   false            6d21h

These priority classes are intended to be used only in situations where workloads are absolutely necessary for cluster or node operation. GKE, for example, uses ResourceQuotas to ensure that only pods in the kube-system namespace are marked with these values. In a Kubeadm cluster, we can be a little more freewheeling with our cluster. I repeated this experiment having changed the priority class of the high priority deployment to system-cluster-critical, and this pod is now after the guaranteed pod in eviction order.

Let’s look at what happens after the cluster evicts these pods. Later from the Kubernetes events:

17m  Warning  FailedScheduling pod/pdb-deployment-865c85586f-8crzh
0/3 nodes are available: 
1 node(s) didn't match pod affinity/anti-affinity rules, 
1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 
1 node(s) had taint {node.kubernetes.io/disk-pressure: }

Affinity Rules

This shows us what is happening on the cluster now that these pods have been evicted from the node. It attempts to reschedule the pod, but the anti-affinity rules prevent it from scheduling on the available node. These rules were added to the test to force the workloads away from each other, but it shows you the real danger that the requiredDuringSchedulingIgnoredDuringExecution flag can have on scheduling workloads. The second node would be available for these evicted workloads, there are enough resources for the pod to land there, but we’ve forced the pods apart on separate nodes to make the scheduler’s job more challenging.

Memory Pressure

Next, I wanted to create memory pressure conditions on the node and force the scheduler to decide which pods to kill.

I used the tail method above to fill the memory usage on a GCE machine. In order to provide myself some overhead and prevent complete filling of the memory on the node, I tuned the kubelet evictionHard parameters to start evicting pods when there were only 250Mi free on the node. These parameters are incredibly important to tune if you are managing Kubernetes clusters yourself as it provides some overhead on the node in case of pending failure conditions. Read more about these conditions in the Kubernetes documentation.

This was when I really saw the downsides of using a pod to create these memory failure conditions. I tried using the nsenter pod first to create this failure and what I got instead was bumped out of the cluster before more interesting things could happen:

25m Warning   Evicted pod/nsenter-vgkgvs
The node was low on resource: memory.
Container nsenter was using 2005480Ki, which exceeds its request of 0.

This is where I ran into problems as the kubelet can read the process tree and see that my node-shell pod was responsible for the consumption of all the free memory on the cluster! :facepalm: The Linux kernel also handles OOM killing of processes in the process tree once the machine runs out of resources, but if resource consumption originates from kubelet managed processes, the eviction limits configured at the node level will begin killing processes before the kernel OOM execution begins.

So, we hop into the node and start creating these failure conditions slowly. Using the pv command to increase memory consumption per second, we are able to slowly reach that point of eviction so that the kubelet can very clearly signal which pod gets bumped first:

$ kubectl get pods --watch
default  pdb-deployment-865c85586f-az4oj  0/1  Evicted
...small time passes
default  pod-priority-high-deployment-7bf444b997-xlz4n  0/1  Evicted
...small time passes
default  guaranteed-deploy-585f7895f9-me4q8  0/1  Evicted

Since the pods are not competing for the memory resource, their behavior is the same once again when being evicted from the node. Again, since the pods are only given one spot to land, the scheduler drops the right back on worker-1:

default       guaranteed-deploy-585f7895f9-me4q8      0/1     Pending

This does no good and the three pods get stuck in a pending/evicted loop until the condition is resolved. Let’s look at some of the other conditions of memory pressure. From the kubelet events:

16m  Warning  OOMKilling  node/gke-resource-testing-scheduling-pool-08d0efc8-0cw5  
Out of memory: Killed process 5793 (nginx) total-vm:33080kB, anon-rss:1036kB, file-rss:0kB, shmem-rss:0kB, UID:101 pgtables:92kB oom_score_adj:994
16m  Warning  SystemOOM  node/gke-resource-testing-scheduling-pool-08d0efc8-0cw5 
System OOM encountered, victim process: nginx, pid: 5793

From our GKE cluster, we get some great information about what processes died and the resource consumption on the node that caused this problem. Again, the kubelet logs provide the same information about pod ordering, but I wanted to call out something interesting. The victim process doesn’t actually provide the pod name, but rather the process name and pid. This can be very challenging as it does not necessarily tie back to the specific pod that was killed. Your pod will show that it was evicted from the node, but you will not be able to tie a specific SystemOOM event to a pod based on the Kubernetes event that was published from the kubelet.

Churning CPU

This is one of the most fun parts of the expierment because I incorporated the pods into the resource consumption on the node.

Instead of using a base nginx image, I pulled down the containerstack/cpustress image and got the pods all running hot based on their reservation. Since the kubelet can allocate CPU cores based on resource limits, I set all the pods running at full throttle and let the kubelet manage their allocations:

containers:
- name: stress-critical
  image: containerstack/cpustress
  args:
  - --cpu
  - "1"

We can see that these containers are now limited by their configured resource limits:

NAME                                            CPU(cores)   MEMORY(bytes)
guaranteed-deploy-585f7895f9-prh9t              1000m        106Mi
guaranteed-deploy-585f7895f9-zrs82              999m         85Mi
pdb-deployment-865c85586f-grq78                 250m         107Mi
pdb-deployment-865c85586f-mkfgt                 250m         85Mi
pod-priority-high-deployment-7bf444b997-gvm4k   100m         44Mi
pod-priority-high-deployment-7bf444b997-s2c4j   101m         107Mi

Now, we hop into the node and consume the remainder of the two cores by running another stress-ng process via SSH on the node:

$ stress-ng --cpu=4
stress-ng: defaulting to a 86400 second run per stressor
stress-ng: dispatching hogs: 4 cpu

This was a very interesting test because the node was able to handle the processes running on the node. Many of the forked subprocesses from stress-ng got killed individually, but the node stayed alive without issue:

$ kubectl top nodes
worker-1       2001m        100%   4969Mi          84%

Even with it at 100% CPU, the kubelet kept reporting ready conditions:

$ kubectl describe node worker-1 | grep KubeletReady
Ready  True  KubeletReady  kubelet is posting ready status. AppArmor enabled

Probes for Production Deployments

What does this tell us? I think this really informs the importance of readiness and liveness timeouts in our web application pods. This node stayed alive, and all the pods on it continued running, even though resource contention was incredibly high. A web application under that kind of load could not have served traffic in any reasonable time. Of course, we have to be careful about this kind of configuration as well, as losing all pods on a cluster due to readiness probe failures can create some complete outages as well. Spreading load out across nodes using preferredDuringSchedulingIgnoredDuringExecution can isolate those types of problems at an application level.

Unresponsive Kubelet

Now, we use the node-level hammer and kill the kubelet. This will create node unavailability, but how long until the pods attempt to be rescheduled. About 30 seconds after killing the process, we start seeing signs of an outage:

36s Warning   NodeNotReady   pod/guaranteed-deploy-585f7895f9-zrs82      Node is not ready
36s Warning   NodeNotReady   pod/pdb-deployment-865c85586f-grq78         Node is not ready
36s Warning   NodeNotReady   pod/pod-priority-high-deployment-7bf444b997-gvm4k   Node is not ready
36s Normal    NodeNotReady   node/worker-1                               Node worker-1 status is now: NodeNotReady

Each pod gets a NodeNotReady event as well as the node itself publishing the event to the default workspace. After this event, though, the pods themselves remain scheduled and even report as running on the node.

NAME                                            READY   STATUS    RESTARTS   AGE
guaranteed-deploy-585f7895f9-prh9t              1/1     Running   0          4h56m
guaranteed-deploy-585f7895f9-zrs82              1/1     Running   0          4h56m
pdb-deployment-865c85586f-grq78                 1/1     Running   0          4h56m
pdb-deployment-865c85586f-mkfgt                 1/1     Running   0          4h56m
pod-priority-high-deployment-7bf444b997-gvm4k   1/1     Running   0          4h56m
pod-priority-high-deployment-7bf444b997-s2c4j   1/1     Running   0          4h56m

This is due to the unavailability timeout setting in the Kubernetes API server. It will leave pods on an unavailable node for 5 minutes by default until it attempts to reschedule them. This is configurable at the pod level as well by setting a specific timed toleration for the unready condition. We can see this taint set on the node now that it is unavailable:

Taints: node.kubernetes.io/unreachable:NoExecute
        node.kubernetes.io/unreachable:NoSchedule

The NoExecute taint starts a timer in the kubernetes controller manager component. The NoExecuteTaintManager now starts moving the pods around after 5 minutes have completed:

event.go:291] "Event occurred" object="worker-1" kind="Node" apiVersion="v1" type="Normal" 
reason="NodeNotReady" message="Node worker-1 status is now: NodeNotReady"
taint_manager.go:106] "NoExecuteTaintManager is deleting pod" pod="default/pdb-deployment-865c85586f-grq78"

The workloads attempt to reschedule but once again we have forced the hand of kube-scheduler. These pods can’t land anywhere due to the rules set on the Kubernetes nodes:

49s  Warning   FailedScheduling       pod/pod-priority-high-deployment-7bf444b997-8g6fz
0/3 nodes are available:
1 node(s) didn't match pod affinity/anti-affinity rules,
1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate,
1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate.

Now, let’s resolve the condition by restarting the kubelet and see that the node is posting ready status once again:

Ready  True    KubeletReady   kubelet is posting ready status. AppArmor enabled

Conclusions

So, after all those Kubernetes logs and events, what have we learned? I take the following lessons away from this experiment:

Guaranteed QoS class provides a significant level of protection against node level resource consumption. Using this for the most critical workloads may cost me some space on a node, but it will be worth it to prevent noise issues on these important workloads.
Use custom Pod Priorities for normal workloads. You should be able to make decisions about which are the most important pods in the application given problems occurring on the node. The kubelet attempts to resolve node level conditions respecting these values, so a couple different levels may help. Beyond two or three levels of priority, it becomes tedious to manage and far more work than it is worth.
What parts of the system should get critical status? Is your logging daemonset system critical? What about a cloud identity provider integration daemonset? Use the system level priority classes for workloads that would ruin all workloads on the cluster if an outage occurred.
Understand your likely failure modes. These vary based on the applications that are being deployed on the cluster. Do you expect disk consumption, memory pressure, or potential CPU consumption issues? These failures can manifest themselves in different ways and the recovery process for each of them depends on eviction constraints and the cluster configuration.
Well-tuned requests & limits help with memory exhaustion. The kubelet is trying to figure out which workloads are “misbehaving” the most, so use those ranges to tell it what appropriate behavior looks like.
Probes failures will not relocate pods. Readiness/liveness/startup probes are helpful when there are application problems, not when there is a system problem. The exception to this is when a deployment update is being rolled out to a node in an inconsistent state. These types of probes, especially startup probes, can help fail a deployment rollout if the node is in a bad state and force redeployment.

That’s all I have! I hope that you were able to learn something from this experiment. I’d love to hear from anyone who has tried similar experiments in other configurations or has any ideas about how to improve what was tested here. Feel free to reach out at us @superorbital.io. If you’d like to work with us and have fun with people deep in the Kubernetes ecosystem, email join-us@superorbital.io!