Kubernetes Scheduling: the scheduling framework

The Kubernetes scheduler is the brain of the cluster, deciding which node runs each Pod. This is the second post in a series where I explore advanced scheduling mechanisms in Kubernetes. In this one, I give an overview of the current state of the Kubernetes scheduling framework. I explain how Kubernetes scheduling works as a batch-oriented process that handles one Pod at a time. I walk through the evolution from the older predicates and priorities model to the modern Scheduling Framework, where each step in the scheduling cycle is an extension point for plugins. I also cover extenders, PreEnqueue plugins, and SchedulingGates, which enable more flexible and complex scheduling workflows. Finally, I highlight projects like Kueue and the Multiarch Tuning Operator that build on these features to support AI, HPC, and multi-architecture workloads.

Table of Contents#

  %%{
    init: {
      'logLevel': 'debug', 'theme': 'default',
       'themeVariables': {
              'git0': '#ff0000',
              'git1': '#00ff00',
      }
      'gitGraph': {'showBranches': true, 'showCommitLabel': true, 'mainBranchName': 'scheduling_series'}
    }
  }%%

gitGraph:
	checkout main
	commit
	branch scheduling_series
	checkout scheduling_series
	commit
	checkout main
    commit
	merge scheduling_series tag:"under the hood"
	commit
	checkout scheduling_series
	commit
	commit
	checkout main
	commit
	merge scheduling_series tag:"scheduling framework" type: HIGHLIGHT
	checkout scheduling_series
	commit
	commit
	checkout main
	merge scheduling_series tag:"scheduler-plugins"
	commit
	commit
	checkout main
	merge scheduling_series tag:"community schedulers"
	checkout scheduling_series
	commit
	commit
	checkout main
	merge scheduling_series tag:"future trends"
	commit

Kubernetes Scheduling Framework#

Kubernetes scheduling is batch-oriented (one Pod at a time per scheduler thread). It does not globally optimize across all Pods in a single pass; instead, it makes a series of local decisions. This can occasionally lead to suboptimal placements (e.g., a Pod might take the “last slot” on a node that a later high-priority Pod could have used). Efforts like Pod preemption and the variety of scoring heuristics try to mitigate this. Some advanced schedulers take a more global or intelligent approach to placement.

In older Kubernetes versions, the scheduling logic was described in terms of Predicates (filtering functions) and Priorities (scoring functions). Administrators could even configure custom scheduling policies by enabling or disabling certain predicates/priorities. Modern Kubernetes has evolved this into the Scheduling Framework, where each stage (filter, score, bind, etc.) is an extension point for plugins. However, the core ideas remain the same. The default scheduler still uses a set of built-in filters (e.g. node resource checks, taint tolerations) and scoring rules (e.g. spread, resource balance), along with a deterministic workflow of phases.

Another concept is extenders – a pre-framework way to run external scheduling logic via webhook calls. Extenders allow an external service to filter or score nodes in addition to the default process¹. While extenders are still supported, they have drawbacks (additional latency for HTTP calls, complexity, and only limited extension points). Kubernetes 1.19+ replaces most use cases for extenders with in-process plugins, which are more efficient.

More recently, the introduced PreEnqueue plugins and the SchedulingGates feature allow for more complex scheduling logic. PreEnqueue plugins can determine if a Pod is ready to be scheduled, while SchedulingGates can block scheduling until certain conditions are met (e.g., waiting for a node to be ready). SchedulingGates can be added at creation time either manually or automatically through mutating webhooks. The scheduler will not schedule a Pod until all gates are removed. External controllers are responsible for removing the gates. This allows for more complex scheduling scenarios, such as waiting for a specific condition to be met before scheduling a Pod. Operators like Kueue and the Openshift’s Multiarch Tuning Operator benefit from this feature. Kueue is a Kubernetes-native job queueing system that allows users to define queues and scheduling policies for HPC/AI workloads. The Multiarch Tuning Operator allows users to define scheduling policies for multi-architecture clusters, ensuring that workloads are scheduled on the appropriate architecture.

References#

The Burgeoning Kubernetes Scheduling System – Part 1: Scheduling Framework - Alibaba Cloud Community ↩︎