Resources in Kubernetes
Resources in Kubernetes are CPU time and RAM that a node allocates for a specific pod or container. We control how containers utilize cluster resources via requests and limits.
This article focuses on requests — and specifically on what Kubernetes actually does with that value at every step, from the moment you apply a manifest to the moment a pod is running on a node.
Resource Types
There are 4 resource types:
cpu— CPU time, measured in millicoresmemory— RAM, measured in bytesstorage— a specific resource type for PVCsextended resources— user-defined resources, maintained by administrators (e.g. GPU usage)
In this article we focus on the first two.
What requests Actually Is
requests is the minimum amount of resources (cpu or memory) that Kubernetes guarantees for a pod or container.
That word guarantees deserves unpacking. It has two distinct meanings depending on where in the system you look:
- At the scheduler level —
requestsis the value the scheduler uses to decide where the pod can run and which node is the best fit. - At the node level —
requestsis translated into OS-level primitives that actually enforce the guarantee at runtime.
Both levels are covered in this article — the scheduler first, then the full chain from kubelet admission through to kernel enforcement.
How the Scheduler Uses requests
When you apply a manifest, the API server extracts the requests values and hands them to the scheduler. The scheduler's job is to find the best node. It does this in two phases: Filtering and Scoring.
Phase 1 — Filtering
Filtering answers the question: "Where can this pod actually run?"
The scheduler checks every available node against the pod's requirements. If a node fails any requirement, it's removed from the list. Requirements include things like taints & tolerations, node affinity, port availability — but the most relevant one here is Resource Fit.
For each node, the scheduler looks at already-allocated resources:
for node in $(kubectl get nodes -o name); do
echo "=== $node ==="
kubectl describe $node | grep -A 5 "Allocated resources"
echo
done
=== node/vm-cluster-node-3 ===
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 3 (37%) 4500m (56%)
memory 5376Mi (33%) 5504Mi (34%)
=== node/vm-cluster-node-2 ===
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1600m (20%) 600m (7%)
memory 1519904Ki (9%) 1284384Ki (7%)
=== node/vm-cluster-node-1 ===
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 300m (3%) 0 (0%)
memory 5260Mi (32%) 8490Mi (53%)
The scheduler checks two conditions:
free memory on node >= requests.memory
free CPU on node >= requests.cpu
If either fails, the node is removed. If no node passes — the pod gets stuck in Pending state.
Filtering is binary: pass or fail. It produces a list of viable nodes, not a ranked list. That's the job of the next phase.
Phase 2 — Scoring
Scoring answers the question: "Which viable node is the best fit?"
If only one node survived filtering, there's nothing to score — that node wins. If multiple nodes survived, the scheduler runs them through a set of scoring plugins. Each plugin scores each node from 0 to 10. The scores are combined and the node with the highest total wins.
Two plugins are central to how requests affects scoring.
Plugin 1 — NodeResourcesFit (LeastAllocated) — source
The default strategy here is LeastAllocated — the scheduler prefers nodes that have the most free resources after placing the pod. The intuition is simple: spread the load, don't pile everything onto one node.
The formula for each resource:
$$ S_{cpu} = \frac{C_{cpu} - R_{cpu}}{C_{cpu}} \times 10 $$
$$ S_{mem} = \frac{C_{mem} - R_{mem}}{C_{mem}} \times 10 $$
Where $R$ is total requested resources on the node including the current pod, and $C$ is total allocatable capacity. The final score is the average:
$$ S_{LeastAllocated} = \frac{S_{cpu} + S_{mem}}{2} $$
A node that is 20% utilized scores much higher than one at 80%. This is the primary force pushing pods toward emptier nodes.
An alternative strategy — MostAllocated — does the opposite, preferring to fill nodes as much as possible before using new ones. This is useful when you want to consolidate workloads and free up nodes for scale-down. It's not the default, but worth knowing it exists.
Plugin 2 — NodeResourcesBalancedAllocation — source
LeastAllocated cares about how much is used. BalancedAllocation cares about the ratio between CPU and memory usage. Even a lightly loaded node can score poorly here if it has, say, 80% of CPU allocated but only 10% of memory — because that imbalance leaves the other resource stranded and unusable for most workloads.
The math:
Step 1: Compute resource fractions (projected utilization after placing the pod)
$$ F_{cpu} = \min\left(1.0,\ \frac{R_{cpu}}{C_{cpu}}\right), \quad F_{mem} = \min\left(1.0,\ \frac{R_{mem}}{C_{mem}}\right) $$
Step 2: Compute imbalance
$$ D = \left| F_{cpu} - F_{mem} \right| $$
- $D = 0$ → perfectly balanced (e.g. 50% CPU & 50% RAM)
- $D = 1$ → completely imbalanced (e.g. 100% CPU & 0% RAM)
Step 3: Score
$$ S_{raw} = (1 - D) \times 100 $$
Or fully expanded:
$$ S_{raw} = \left(1 - \left| \frac{R_{cpu}}{C_{cpu}} - \frac{R_{mem}}{C_{mem}} \right|\right) \times 100 $$
The Scheduler Framework then normalizes all plugin scores to 0–10 and combines them. The node with the highest combined score is selected for binding.
How the two plugins work together
LeastAllocated and BalancedAllocation pull in the same general direction but optimize for different things. The first pushes pods toward empty nodes. The second ensures that whichever node is chosen, its CPU and memory are utilized proportionally. Together they describe a scheduler that is trying to keep the cluster both spread and efficient.
You can tune the weight of each resource within BalancedAllocation via KubeSchedulerConfiguration:
pluginConfig:
- name: NodeResourcesBalancedAllocation
args:
resources:
- name: cpu
weight: 1
- name: memory
weight: 2 # memory imbalance penalized 2x more
Other Scoring Plugins
Resource-based scoring is the dominant logic when requests are in play, but the scheduler runs several other plugins in the same pass. A few worth knowing:
DefaultPodTopologySpread / Pod Spreading — source — prefers nodes with fewer pods of the same type already running. Improves fault tolerance by avoiding concentration. Can tip the decision when resource scores between nodes are close.
ImageLocality — source — prefers nodes that already have the container image cached locally. A node with the full image scores higher than one that would need to pull it. Not about resources at all, but directly affects startup latency and can be surprisingly influential for large images.
InterPodAffinity — source — scores nodes based on affinity and anti-affinity rules between pods. Important to be aware of because it can override resource-based scores when rules are explicitly set — a node that looks worse on resources may win if affinity rules strongly favor it.
TaintToleration — source — has both a filtering role (hard taints block scheduling) and a scoring role. Nodes with fewer unmatched taints score higher, subtly steering pods away from nodes that weren't intended for general workloads.
None of these are about requests values directly, but they all participate in the same final score. The node the scheduler picks is the winner across all of these plugins combined — not just the one with the most free resources. The default weights for every plugin are defined in default_plugins.go — worth reading to see exactly what the scheduler prioritizes out of the box.
What the Scheduler Guarantees — and What It Doesn't
After scoring, the API server sends a binding command to kubelet on the winning node, and the pod starts. At this point the scheduler's job is done.
It's worth being precise about what the scheduler's requests check actually guarantees: it guarantees that at scheduling time, enough resources were declared available on the node. It does not guarantee that those resources will be available at runtime if other pods on the node are consuming more than their declared requests.
That enforcement — the actual runtime guarantee — happens at the OS level, via cgroups for CPU and the OOM killer for memory. That's the subject of the next section.
What Happens On the Node When a Pod Is Scheduled
Before diving into cgroups, let's look at the full picture of what happens on the node — from kubelet receiving the binding to a process actually running in an isolated environment.
kubectl apply
│
▼
API Server stores PodSpec
│
▼
Scheduler: Filtering → Scoring → writes Binding to API Server
# Described in previous section
│
▼
Kubelet watches API Server, sees pod assigned to its node
│
├─► Kubelet admission checks (including Resource Fit)
│
├─► Creates pod-level cgroup hierarchy
│ writes cpu.weight + memory.min ← requests values land here
│
▼
Kubelet → gRPC (protobuf) → containerd
│
├─► RunPodSandbox:
│ containerd creates container-level cgroup for sandbox
│ containerd → shim → runc → pause container running
│ network namespace created, held open by pause
│
├─► CreateContainer + StartContainer (per app container):
│ containerd translates CRI spec → OCI spec (config.json)
│ containerd → shim → runc
│ runc creates pid, mnt, uts namespaces
│ runc sets up overlayfs rootfs
│ runc writes PID → cgroup.procs ← process joins cgroup
│ runc calls clone() + execve() ← process is running
│
▼
Linux kernel enforces:
cpu.weight → CFS/EEVDF scheduling weight under contention
memory.min → memory reclaim protection
As requests travels through this chain it is translated several times — from a manifest value, to a protobuf field, to an OCI spec entry, to a cgroupfs file. At each layer it means something slightly different but consistent. Let's look at each step in detail.
Step 1 — Kubelet Admission Checks
Kubelet doesn't trust the scheduler blindly. Before doing any work, it runs its own admission chain — a set of checks against the node's current, real-time state. This is necessary because the scheduler works from a cached view of the cluster. Between the moment the scheduler decided "this node has enough room" and the moment kubelet receives the binding, other pods may have been scheduled onto the same node and the cache may be stale.
The admission chain (source):
| Check | What it validates | Failure result |
|---|---|---|
| Predicate checks | NodeAffinity, Taints, Tolerations — same logic as scheduler filtering, re-run locally | Pod rejected, event FailedScheduling |
| Volume checks | PVCs available, secrets and ConfigMaps exist and are mountable | Pod waits with status ContainerCreating |
| Image pull | Registry access, image existence | ErrImagePull / ImagePullBackOff |
| Security context | RunAsRoot, capabilities, SELinux labels | Rejected by PodSecurity admission |
| Resource Fit | requests and limits against node's actual allocatable resources |
Pod status OutOfcpu or OutOfmemory, back to pending |
The Resource Fit check is the most important one for our topic. Kubelet computes:
sum of requests (all currently running pods)
+ requests of the new pod
≤ node allocatable
For both CPU and memory independently. Note that node allocatable is not the same as total node capacity. Kubelet subtracts its own reserved resources and the eviction threshold before doing the check:
allocatable = capacity - kube-reserved - system-reserved - eviction-threshold
This means a node may appear to have free capacity from the outside while kubelet correctly rejects new pods because the safety margins are already consumed.
If Resource Fit fails, the pod is not rejected permanently — it goes back to Pending and the scheduler will attempt to place it again, possibly on a different node. Only after all admission checks pass does kubelet proceed to the next step.
Step 2 — Kubelet Creates the cgroup Hierarchy
With admission passed, kubelet creates the cgroup tree for the pod before calling the container runtime. This is an important ordering: the resource envelope exists at the kernel level before any process is spawned inside it.
The hierarchy kubelet creates:
/sys/fs/cgroup/
└── kubepods/ ← root for all Kubernetes workloads
└── burstable/ ← QoS class (covered in the QoS article)
└── pod<uid>/ ← pod-level cgroup ← kubelet creates this
cpu.weight ← sum of container requests
memory.min ← sum of container requests
The container-level cgroups inside the pod directory are created later by containerd. Kubelet owns the outer boundary; the runtime owns the inner ones.
cgroupv1 vs cgroupv2: The parameter name for CPU weight changed between versions. In cgroupv1, CPU priority is set via
cpu.shares— an arbitrary unit where the default is 1024, and values are compared relatively between siblings. In cgroupv2, this was replaced bycpu.weight— a cleaner scale of 1 to 10000, default 100, with the same relative semantics. Kubernetes >= 1.25 supports cgroupv2 properly. The mapping kubelet uses:cpu.weight = 1 + ((millicores - 1) * 9999) / 255000. The concept is identical in both versions — relative share under contention — only the scale and file name differ.
Step 3 — Kubelet Calls the CRI
Kubelet communicates with containerd via gRPC using protobuf — not JSON or YAML. The interface is defined in api.proto.
It sends two types of calls:
RunPodSandbox — sets up the shared pod environment. containerd creates the network namespace, starts the pause container (which does nothing except hold the namespaces open for the pod's lifetime), and creates the sandbox-level cgroup. The pause container is a real container — containerd spawns a shim and calls runc for it just like any other container.
CreateContainer + StartContainer — one pair per app container. containerd receives the CRI spec in protobuf format and translates it into an OCI runtime spec — a config.json file written to a bundle directory on disk. This is the moment where Kubernetes concepts disappear and everything becomes pure Linux primitives:
"linux": {
"resources": {
"unified": {
"cpu.weight": "100",
"memory.min": "67108864", ← from requests
"memory.max": "134217728", ← from limits
"cpu.max": "50000 100000" ← from limits
}
},
"cgroupsPath": "kubepods-burstable-pod<uid>.slice:containerd:<container-id>"
}
Step 4 — containerd-shim + runc
containerd spawns a containerd-shim process for each container — a lightweight process that stays alive for the container's lifetime and manages I/O, exit codes, and signals. The shim then calls runc, which does the actual low-level work:
- Creates the remaining namespaces:
pid,mnt,uts(the network namespace was already created for the sandbox by the pause container step) - Sets up the overlayfs rootfs from the image layers
- Creates the container-level cgroup directory under the pod cgroup
- Writes the new PID into
cgroup.procs— this is the moment the process joins the cgroup, beforeexecve()is called - Calls
clone()+execve()— the process is now running inside all namespaces, under the cgroup
The ordering of the last two steps matters: the process is inside the cgroup before it becomes the application. There is no moment where it runs outside its resource envelope.
What the Kernel Enforces
Once the process is running, enforcement is entirely at the kernel level — Kubernetes is no longer involved.
cpu.weight → CFS/EEVDF scheduling weight
The kernel scheduler uses weights as relative shares under contention. If the node is idle, a container can use far more CPU than its request value — the weight only matters when multiple processes compete for CPU time. This is why requests do not cap CPU usage. A request of 100m does not mean the container is limited to 100 millicores — it means the container is guaranteed at least that share when the node is busy. Capping is the job of limits, enforced via cpu.max.
memory.min → memory reclaim protection
memory.min instructs the kernel's memory reclaim system: do not reclaim pages from this cgroup under memory pressure. When the node is running low on memory and the kernel needs to free pages, it skips cgroups with memory.min set and targets those without it first. This is the actual runtime enforcement of the requests guarantee — not a Kubernetes mechanism, a direct kernel memory management instruction.
oom_score_adj → OOM kill priority
memory.min protects against reclaim — the gradual process of freeing unused pages. But if memory pressure becomes critical and reclaim isn't enough, the kernel's OOM killer steps in and terminates a process outright. oom_score_adj is the knob that determines which process gets killed first.
The value ranges from -1000 to +1000. A higher score means the process is a more likely kill target. Kubelet sets this per container based on QoS class — which is determined by how requests and limits are configured:
| QoS class | oom_score_adj |
Meaning |
|---|---|---|
Guaranteed |
-997 | Nearly unkillable — requests == limits for all containers |
Burstable |
2 to 999 (scaled by memory request) | Killed before Guaranteed, after BestEffort |
BestEffort |
1000 | Killed first — no requests or limits set at all |
The exact value for Burstable pods is calculated as: 1000 - (1000 * memoryRequest / machineMemoryCapacity). A pod requesting more memory gets a lower (safer) score — the kernel treats a larger memory commitment as a signal that the pod is more important.
This is why requests affect OOM behavior even indirectly: the QoS class is derived from the relationship between requests and limits, and the QoS class determines oom_score_adj. The full QoS classification logic is covered in the next article.