What This Article Is About
This is the third article in the series. The first article
covered requests — how the scheduler uses it to place pods, and how it becomes
kernel primitives via cgroups and OOM scoring. The second article
covered limits — why limits are invisible to the scheduler, how they become hard
kernel ceilings, and what happens when a container hits them.
Both articles ended at the same place: oom_score_adj. The first mentioned it in
passing as a consequence of QoS class. The second showed that hitting memory.max
triggers an in-cgroup OOM kill, and noted that chronic memory pressure eventually
leads to CrashLoopBackOff. Neither article explained the mechanism that connects
QoS classification to kernel kill priority — or what happens when the pressure
isn't coming from inside a single container, but from the node as a whole.
That's what this article is about.
A Short Recap
Resources in Kubernetes are CPU time and RAM allocated to a pod or container,
controlled via requests and limits. There are four resource types:
cpu— CPU time, measured in millicoresmemory— RAM, measured in bytesstorage— used for PVCsextended resources— user-defined resources such as GPUs
This article focuses on cpu and memory, specifically on what happens when
their requests and limits values are combined — or absent.
What Quality of Service Is
Quality of Service (QoS) in Kubernetes is a classification system. Kubernetes
looks at how requests and limits are configured for every container in a pod
and assigns the pod one of three classes: Guaranteed, Burstable, or
BestEffort.
The classification has two effects. The primary one is kill priority under node memory pressure: when the node runs low on memory and needs to reclaim it, QoS class determines which pods are evicted first. The secondary effect is CPU scheduling weight: QoS class influences how the kernel distributes CPU time under contention.
Neither effect is visible during normal operation. QoS only matters when something is wrong.
How Classes Are Assigned
The classification rules are applied per-pod, based on the combined configuration of all containers in the pod (including init containers).
Guaranteed
A pod is Guaranteed if every container — without exception — has both
requests and limits set for CPU and memory, and the values are equal:
resources:
requests:
cpu: "500m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "128Mi"
If any single container in the pod omits a request, omits a limit, or has
requests != limits for either resource, the pod is not Guaranteed.
There is one shortcut: if you set only limits and omit requests, Kubernetes
automatically sets requests equal to limits. This means a pod with only
limits defined can still qualify as Guaranteed — as long as limits are set
for all containers on both CPU and memory.
BestEffort
A pod is BestEffort if no container has any requests or limits set at all:
resources: {}
Any resource configuration — even a single requests.cpu on a single
container — disqualifies the pod from BestEffort and moves it to Burstable.
Burstable
Everything else is Burstable. This is the most common class in practice. A
pod is Burstable if it doesn't qualify as Guaranteed and isn't
BestEffort — meaning at least one container has at least one requests or
limits value set, but the pod as a whole doesn't meet the strict
requests == limits requirement across all containers and resources.
Multi-Container Pods
For pods with multiple containers, the classification considers all of them
together. A pod with three containers where two are Guaranteed-equivalent and
one is BestEffort-equivalent is classified as Burstable — the weakest
configuration of any container determines the pod's class.
You can check the assigned class on any running pod:
$ kubectl get pod <pod-name> -o jsonpath='{.status.qosClass}'
Guaranteed
Two Distinct OOM Scenarios
Before looking at how QoS translates into kernel values, it's important to distinguish two different situations where OOM kills happen. The previous article covered the first. This article is primarily about the second.
Scenario 1: In-cgroup OOM kill. A single container exceeds its own
memory.max value. The kernel's OOM killer fires scoped to that container's
cgroup and kills a process inside it. This is a local event — it doesn't
involve other pods, doesn't consult QoS class, and doesn't use
oom_score_adj for target selection across the node. The target is the
process inside that cgroup with the highest OOM score. QoS class is
irrelevant here because the boundary is already set by memory.max.
Scenario 2: Node-level memory pressure. The node as a whole is running
low on memory. memory.max values haven't been hit — the problem is that
the sum of all container memory usage is approaching the node's physical
limit. Kubelet detects this condition, raises a MemoryPressure taint on
the node, and begins evicting pods. The kernel's node-level OOM killer can
also fire independently if the situation becomes critical before kubelet
acts. Both kubelet eviction and the kernel OOM killer use oom_score_adj
to determine kill order — and oom_score_adj is set based on QoS class.
QoS class is only meaningful in Scenario 2.
What Triggers Node Memory Pressure
Kubelet continuously monitors node memory usage. When available memory drops
below a configurable eviction threshold (default 100Mi, but tunable via
--eviction-hard), kubelet sets the MemoryPressure condition on the node
and begins evicting pods.
kubectl describe node <node-name> | grep -A 5 "Conditions"
Conditions:
Type Status
MemoryPressure True ← node is under memory pressure
DiskPressure False
PIDPressure False
Ready True
When MemoryPressure is True, kubelet stops admitting new BestEffort
pods to this node and begins selecting existing pods for eviction. The
selection is based on QoS class, then on how far the pod's memory usage
exceeds its request.
The kernel's node-level OOM killer operates independently of kubelet. If
memory is exhausted before kubelet's eviction loop runs — which can happen
under sudden memory spikes — the kernel fires first and selects victims using
oom_score_adj directly. Kubelet eviction is the intended mechanism.
The kernel OOM killer is the safety net.
oom_score_adj — The Kernel Side of QoS
oom_score_adj is a per-process Linux kernel parameter that adjusts a
process's likelihood of being selected by the OOM killer. It ranges from
-1000 to +1000. A higher value makes the process a more likely kill target.
A value of -1000 makes a process completely immune to OOM kills.
Kubelet sets oom_score_adj for each container's main process based on
the pod's QoS class:
| QoS class | oom_score_adj |
Kill priority |
|---|---|---|
Guaranteed |
-997 | Last — nearly immune |
Burstable |
2 to 999 | Middle — depends on memory request |
BestEffort |
1000 | First — always the primary target |
The value for Burstable pods is not fixed. It is calculated per-pod based
on the pod's memory request relative to total node memory:
$$ \text{oom score adj} = \min\left( \max\left( 2,\ 1000 - \frac{1000 \cdot \text{memoryRequestBytes}}{\text{machineMemoryCapacityBytes}} \right),\ 999 \right) $$
The fraction (1000 × memoryRequestBytes) / machineMemoryCapacityBytes is
rounded up — 934.12 becomes 935.
The consequence of this formula: a Burstable pod that requests more memory
gets a lower (safer) oom_score_adj. A pod requesting 12 GB on a 16 GB node
gets a score close to 2 — nearly as protected as Guaranteed. A pod
requesting 100 MB on the same node gets a score close to 999 — nearly as
exposed as BestEffort.
This is intentional. The kernel treats a larger memory commitment as a signal that the pod is more important to keep alive. A pod that declared a large request is expected to actually need that memory; a pod that declared a tiny request is more likely to be safely killed without major consequence.
You can verify the value on a running node:
# Find the PID of the container's main process
cat /proc/<pid>/status | grep OOMScore
OOMScoreAdj: 461
Or by reading the file directly:
cat /proc/<pid>/oom_score_adj
The Pause Container
Each pod has a pause container — a minimal process that holds the pod's
network namespace open for the lifetime of the pod. If the pause process is
killed, the network namespace is destroyed. All containers in the pod lose
their network connectivity and are effectively killed regardless of their
own OOM scores.
For this reason, kubelet assigns the pause container oom_score_adj = -998 —
one point safer than Guaranteed app containers at -997. This ensures the
pause process is the absolute last thing the OOM killer targets within a pod,
even if the pod itself is Guaranteed.
The pause container does not consume meaningful memory in normal operation (typically under 1 MB), so its presence at -998 does not affect node-level OOM decisions in practice. It is a protection measure, not a scheduling factor.
The Full Chain: From QoS Class to Kernel
Here is how the QoS classification connects to kernel enforcement, following the same chain structure as the previous articles:
kubectl apply
│
▼
API Server stores PodSpec
│
▼
Scheduler places pod on node
# QoS class not used for scheduling — only requests matter here
│
▼
Kubelet computes QoS class from PodSpec
│
├─► Classifies pod: Guaranteed / Burstable / BestEffort
│ based on requests and limits across all containers
│
├─► Creates pod-level cgroup under the matching QoS subdirectory:
│ /sys/fs/cgroup/kubepods/guaranteed/pod<uid>/
│ /sys/fs/cgroup/kubepods/burstable/pod<uid>/
│ /sys/fs/cgroup/kubepods/besteffort/pod<uid>/
│
├─► Calculates oom_score_adj per container:
│ Guaranteed → -997
│ BestEffort → 1000
│ Burstable → 2 to 999 (formula above)
│
▼
Kubelet → gRPC → containerd → runc
│
├─► runc writes container PID to cgroup.procs
│ (process is now under the cgroup hierarchy)
│
├─► runc (or kubelet via /proc) writes oom_score_adj to:
│ /proc/<pid>/oom_score_adj
│
▼
Linux kernel:
cgroup location → determines OOM kill scope and priority
oom_score_adj → adjusts kill likelihood within scope
memory.min → protects against reclaim (from requests)
memory.max → hard ceiling, in-cgroup OOM (from limits)
The cgroup directory itself encodes the QoS class. The kernel doesn't know
about Kubernetes QoS classes — it only knows about oom_score_adj values and
cgroup boundaries. The mapping from class to value is done entirely by kubelet
before the process starts.
QoS and the cgroup Hierarchy
The three QoS classes map directly to three subdirectories under
/sys/fs/cgroup/kubepods/:
/sys/fs/cgroup/kubepods/
├── guaranteed/
│ └── pod<uid>/
│ └── <container-id>/
├── burstable/
│ └── pod<uid>/
│ └── <container-id>/
└── besteffort/
└── pod<uid>/
└── <container-id>/
This placement is meaningful beyond organization. The kernel's memory reclaim
and OOM kill logic operates on the cgroup hierarchy. When the node is under
pressure, the kernel can target entire cgroup subtrees. BestEffort pods, all
grouped under besteffort/, are structurally separated from Guaranteed pods
under guaranteed/. The hierarchy makes the kill ordering easier to enforce
at the kernel level.
You can inspect this on any node:
ls /sys/fs/cgroup/kubepods/burstable/ | head -5
pod3f2a1b4c-...
pod7e9d0c12-...
pod1a4f8e23-...
QoS and CPU
CPU QoS doesn't have a kill mechanism. There is no "CPU OOM kill" — a
container that exceeds its CPU request is throttled (if it also has a limit)
or competes for time (if it doesn't). QoS class affects CPU behavior through
a different channel: cpu.weight in the cgroup hierarchy.
When kubelet creates the pod-level cgroup, it sets cpu.weight based on the
sum of CPU requests across the pod's containers. The QoS class subtree itself
also has a cpu.weight that reflects the aggregate weight of all pods in that
class.
The practical effect: under CPU contention, processes in higher-priority
cgroups (with higher cpu.weight) get proportionally more CPU time. A
Guaranteed pod with cpu.weight=10000 (the maximum) will receive far more
CPU time than a BestEffort pod during a contention event.
The BestEffort subtree gets cpu.weight=2 — the minimum — which means
BestEffort pods are the first to be starved of CPU time when the node is
busy, in the same way they are the first to be killed when the node is low
on memory. The behavior is consistent across both resources.
You can verify the weight for any container:
cat /sys/fs/cgroup/kubepods/burstable/pod<uid>/<container-id>/cpu.weight
100
What You Should Take Away
The three QoS classes are not a resource management feature — they are a failure management feature. They only matter when something is going wrong on the node, and they determine who pays the price.
A few practical points worth keeping in mind:
Pods without any requests or limits are BestEffort and will be killed
first. This is often unintentional — it happens when teams omit resource
configuration entirely. Check with:
kubectl get pods -A -o json | \
jq '.items[] | select(.status.qosClass == "BestEffort") |
.metadata.namespace + "/" + .metadata.name'
Guaranteed pods are the safest but also the least efficient — a pod with
requests == limits can never burst. On a lightly loaded node, a
Guaranteed pod with cpu: 500m will never use more than 500 millicores
even if the node has plenty of idle capacity.
Burstable is the right class for most production workloads — set
requests accurately to reflect normal usage, set limits to cap runaway
consumption, and accept that the pod can be evicted under extreme node
pressure. The eviction order within Burstable is determined by how far
above its memory request the pod is currently running.
What's Next
The next article covers ResourceQuota and LimitRange kinds in Kubernetes
as final part of this series.