Kubernetes Scheduling: the scheduling framework
The Kubernetes scheduler is the brain of the cluster, deciding which node runs each Pod. This is the second post in a series where I explore advanced scheduling mechanisms in Kubernetes. In this one, I give an overview of the current state of the Kubernetes scheduling framework. I explain how Kubernetes scheduling works as a batch-oriented process that handles one Pod at a time. I walk through the evolution from the older predicates and priorities model to the modern Scheduling Framework, where each step in the scheduling cycle is an extension point for plugins. I also cover extenders, PreEnqueue plugins, and SchedulingGates, which enable more flexible and complex scheduling workflows. Finally, I highlight projects like Kueue and the Multiarch Tuning Operator that build on these features to support AI, HPC, and multi-architecture workloads.
Table of Contents#
- Under the hood
- (You’re here) The scheduling framework
- The scheduler-plugins Project
- Community schedulers in the ecosystem
- Future trends
%%{ init: { 'logLevel': 'debug', 'theme': 'default', 'themeVariables': { 'git0': '#ff0000', 'git1': '#00ff00', } 'gitGraph': {'showBranches': true, 'showCommitLabel': true, 'mainBranchName': 'scheduling_series'} } }%% gitGraph: checkout main commit branch scheduling_series checkout scheduling_series commit checkout main commit merge scheduling_series tag:"under the hood" commit checkout scheduling_series commit commit checkout main commit merge scheduling_series tag:"scheduling framework" type: HIGHLIGHT checkout scheduling_series commit commit checkout main merge scheduling_series tag:"scheduler-plugins" commit commit checkout main merge scheduling_series tag:"community schedulers" checkout scheduling_series commit commit checkout main merge scheduling_series tag:"future trends" commit
Kubernetes Scheduling Framework#
Kubernetes scheduling is batch-oriented (one Pod at a time per scheduler thread). It does not globally optimize across all Pods in a single pass; instead, it makes a series of local decisions. This can occasionally lead to suboptimal placements (e.g., a Pod might take the “last slot” on a node that a later high-priority Pod could have used). Efforts like Pod preemption and the variety of scoring heuristics try to mitigate this. Some advanced schedulers take a more global or intelligent approach to placement.
In older Kubernetes versions, the scheduling logic was described in terms of Predicates (filtering functions) and Priorities (scoring functions). Administrators could even configure custom scheduling policies by enabling or disabling certain predicates/priorities. Modern Kubernetes has evolved this into the Scheduling Framework, where each stage (filter, score, bind, etc.) is an extension point for plugins. However, the core ideas remain the same. The default scheduler still uses a set of built-in filters (e.g. node resource checks, taint tolerations) and scoring rules (e.g. spread, resource balance), along with a deterministic workflow of phases.
Another concept is extenders – a pre-framework way to run external scheduling logic via webhook calls. Extenders allow an external service to filter or score nodes in addition to the default process1. While extenders are still supported, they have drawbacks (additional latency for HTTP calls, complexity, and only limited extension points). Kubernetes 1.19+ replaces most use cases for extenders with in-process plugins, which are more efficient.
More recently, the introduced PreEnqueue
plugins and the SchedulingGates
feature allow for more complex scheduling logic. PreEnqueue
plugins can determine if a Pod is ready to be scheduled, while SchedulingGates
can block scheduling until certain conditions are met (e.g., waiting for a node to be ready). SchedulingGates
can be added at creation time either manually or automatically through mutating webhooks. The scheduler will not schedule a Pod until all gates are removed.
External controllers are responsible for removing the gates. This allows for more complex scheduling scenarios, such as waiting for a specific condition to be met before scheduling a Pod.
Operators like Kueue and the Openshift’s Multiarch Tuning Operator benefit from this feature. Kueue
is a Kubernetes-native job queueing system that allows users to define queues and scheduling policies for HPC/AI workloads. The Multiarch Tuning Operator allows users to define scheduling policies for multi-architecture clusters, ensuring that workloads are scheduled on the appropriate architecture.