The Kubernetes scheduler is the brain of the cluster, deciding which node runs each Pod. This post is part of a series where I explore advanced scheduling mechanisms in Kubernetes. In this part, I focus on the current state of the default scheduler.

I walk through how the default scheduler works under the hood, breaking it down into queueing, filtering, scoring, binding, and preemption. I explain how Pods move through different queues, how the scheduler picks viable nodes, and how it scores and selects the best one. I also touch on newer features like QueueingHint and PreEnqueue plugins, and discuss how the scheduler balances performance and fairness at scale. If you’re curious about how Kubernetes makes scheduling decisions, this post offers a detailed look at the mechanisms behind it.

Table of Contents#

  1. (You’re here) Under the hood
  2. The scheduling framework
  3. The scheduler-plugins project
  4. Community schedulers in the ecosystem
  5. Future trends
  %%{
    init: {
      'logLevel': 'debug', 'theme': 'default',
       'themeVariables': {
              'git0': '#ff0000',
              'git1': '#00ff00',
      }
      'gitGraph': {'showBranches': true, 'showCommitLabel': true, 'mainBranchName': 'scheduling_series'}
    }
  }%%

gitGraph:
	checkout main
	commit
	branch scheduling_series
	checkout scheduling_series
	commit
	checkout main
    commit
	merge scheduling_series tag:"under the hood" type: HIGHLIGHT
	commit
	checkout scheduling_series
	commit
	commit
	checkout main
	commit
	merge scheduling_series tag:"scheduling framework"
	checkout scheduling_series
	commit
	commit
	checkout main
	merge scheduling_series tag:"scheduler-plugins"
	commit
	commit
	checkout main
	merge scheduling_series tag:"community schedulers"
	checkout scheduling_series
	commit
	commit
	checkout main
	merge scheduling_series tag:"future trends"
	commit
    

Kubernetes Scheduler#

The Kubernetes default scheduler1 (kube-scheduler) is a control-plane component that watches for unscheduled Pods and assigns each to an optimal Node. Scheduling a Pod involves two main phases: a scheduling cycle (selecting a suitable node) and a binding cycle (confirming the assignment).

At a high level, the scheduler’s workflow can be broken down into queueing, filtering, scoring, and binding:

  • Queueing: New Pods without a Node are added to a priority queue2. The scheduler pulls Pods from the ActiveQ queue to process. Since Kubernetes 1.263, pods are added to the ActiveQ queue only if PreEnqueue plugins determine if the Pod is ready for scheduling. If Pods are scheduling gated, they are added to the UnschedulableQ queue. Unschedulable pods can be moved to the podBackoffQ to retry scheduling on request, for example due to the removal of Scheduling Gates or an event triggered when a new node joins the cluster. Pods that have been unschedulable for a while and need to be retried later. Kubernetes 1.32 promoted a recent feature - QueueingHint - to beta. It is a callback function for deciding whether a pod might become schedulable and requeueing it promptly to the ActiveQ queue. This design balances fairness (no Pod waits forever) with honoring priority levels. By default, Pods are prioritized by their .spec.priority (higher priority first) and within the same priority by FIFO order, though this is configurable via a QueueSort plugin. This ensures important Pods get scheduled sooner, while maintaining fairness by preventing starvation of lower-priority pods.

  • Filtering (Predicates): In the filtering phase, the scheduler finds all feasible nodes for the Pod. It evaluates each candidate node against a set of predicate checks, Filter plugins. Filters check Node taints and Pod tolerations, affinity/anti-affinity rules, volume topology, etc. These predicate functions are essentially boolean conditions – a node must pass all filters to be considered viable. If no node passes filtering, the Pod cannot be scheduled (it remains pending).

  • Scoring (Priorities): After filtering, usually multiple nodes are still eligible. The scoring phase ranks the feasible nodes to choose the most suitable one. The scheduler calculates a score for each node by running various scoring functions (now Score plugins). These scores (typically 0–100) assess how well a node satisfies soft preferences. For example, one scoring plugin favors spreading Pods across nodes for availability, while another (ImageLocality) gives higher score to nodes that already have the Pod’s container image to reduce pull time. Each plugin contributes to a node’s score, and Kubernetes sums and normalizes them to pick the highest scoring node 4. If multiple nodes tie with top score, the scheduler will select one at random to break the tie, which avoids bias and improves fairness.

  • Binding: Once the scheduler selects the best Node, it issues a Bind action – essentially writing the decision back to the API server by setting the Pod’s .spec.nodeName to that Node. At this point, the Pod is considered scheduled and the kubelet on the target node will eventually start the Pod’s containers. The binding step is a critical section – Kubernetes uses optimistic concurrency and retries to handle the case where the cluster state changed (e.g., another pod took the resources). In normal operation, this all happens quickly and transparently.

  • Preemption (if needed): An additional step, preemption5, comes into play when no feasible node is found for a Pod but there are lower-priority pods occupying resources. Kubernetes will attempt to evict (preempt) some lower-priority Pods to free resources for a high-priority Pod that is pending. Preemption is only considered for Pods with a PriorityClass, and it respects certain policies (it won’t evict Pods with equal or higher priority, and it tries to minimize the number of victims). This mechanism helps honor quality of service and SLA for critical workloads – e.g. if a vital Pod can’t schedule due to resource saturation by others, the scheduler may remove less important Pods to make room. Preemption is triggered in the PostFilter phase of the scheduling cycle (after failing to find a node), ensuring that the scheduler tries all normal scheduling means first.

The default scheduling algorithm must maintain a balance between latency, throughput, and accuracy of placement in large clusters. Kubernetes scheduling is a NP-hard6 optimization problem, but the scheduler is designed to work efficiently even with thousands of nodes and pods. For scalability, the scheduler doesn’t examine every node for every Pod – by default, it samples a percentage of nodes to evaluate when the cluster is very large. This percentageOfNodesToScore parameter helps cap the work per scheduling cycle, trading a bit of optimality for speed at scale. In practice, this means in a 5,000-node cluster, the scheduler might score only 10% of nodes (500 nodes) for each Pod, dramatically reducing scheduling latency with minimal impact on quality.

The scheduler’s throughput (pods scheduled per second) becomes important at scale. Kubernetes can schedule dozens of pods per second under typical conditions, but heavy custom plugins or extenders can slow it down. Ensuring fairness in scheduling is also critical in multi-tenant clusters. Kubernetes relies on the priority queue and optional Pod quotas to prevent any one workload from hogging all resources. For example, the scheduler by default sorts Pods by priority, so a burst of low-priority Pods won’t delay a high-priority Pod. At the same time, the random tie-breaking and round-robin binding of equivalent pods helps distribute opportunities. Advanced policies like Capacity Scheduling can enforce fairness more explicitly across teams or queues. In summary, the default scheduler aims to be fast, reasonably fair, and “good enough” in placement for general workloads, while allowing extension for specialized needs.

%%{init: { "theme": "redux-dark", "flowchart": { "htmlLabels": true, "curve": "linear", "defaultRenderer": "elk" } } }%%
flowchart TB
subgraph Scheduling_Cycle["Scheduling Cycle"]
	Q(["ActiveQ"])
	n1["PreEnqueue"]
	F["Filter: discard infeasible nodes"]
	X["PostFilter: Preemption?"]
	S["Score: rank feasible nodes"]
	H["Select highest score node"]
	R["Reserve node"]
	n2(["UnschedulableQ"])
	n3(["PodBackOffQ"])
	n5["QueueSort"]
end
subgraph Binding_Cycle["Binding Cycle"]
	B["Bind Pod to Node"]
end
n1 --> Q & n2
Q --> n5
n5 --> F
F -- no nodes --> X
X -- "evict low-priority pod" --> Q
F --> S
S --> H
H --> R
R --> B
P["New Pod"] --> n1
X -- no eviction --> n2
n2 -- external events or QueueingHint --> n3
n3 --> Q
    

References#