Kubernetes Scheduling: the scheduler-plugins project
The Kubernetes scheduler is the brain of the cluster, deciding which node runs each Pod. This is the third post in a series where I explore advanced scheduling mechanisms in Kubernetes. In this one, I focus on the scheduler-plugins project by SIG Scheduling. I explain how this project extends the Kubernetes Scheduling Framework with a collection of out-of-tree plugins that enable advanced behaviors like gang scheduling, NUMA-aware placement, load-aware scoring, and more. I walk through key plugins such as Capacity Scheduling, Coscheduling, Trimaran, and Network-Aware Scheduling, and show how they solve real-world scheduling problems. I also cover how to integrate these plugins into your cluster using a custom scheduler or as a secondary scheduler, and discuss the tradeoffs of each approach.
Table of Contents#
- Under the hood
- The scheduling framework
- (You’re here) The scheduler-plugins Project
- Community schedulers in the ecosystem
- Future trends
%%{ init: { 'logLevel': 'debug', 'theme': 'default', 'themeVariables': { 'git0': '#ff0000', 'git1': '#00ff00', } 'gitGraph': {'showBranches': true, 'showCommitLabel': true, 'mainBranchName': 'scheduling_series'} } }%% gitGraph: checkout main commit branch scheduling_series checkout scheduling_series commit checkout main commit merge scheduling_series tag:"under the hood" commit checkout scheduling_series commit commit checkout main commit merge scheduling_series tag:"scheduling framework" checkout scheduling_series commit commit checkout main merge scheduling_series tag:"scheduler-plugins" type: HIGHLIGHT commit commit checkout main merge scheduling_series tag:"community schedulers" checkout scheduling_series commit commit checkout main merge scheduling_series tag:"future trends" commit
The Kubernetes scheduler-plugins Project#
As Kubernetes deployments grew in scale and diversity, the need arose to customize scheduling for specific scenarios (batch jobs, real-time workloads, multi-tenant fairness, etc.). While the Scheduling Framework enables custom plugins, writing and maintaining those in production can be non-trivial. The Kubernetes SIG Scheduling created the kubernetes-sigs/scheduler-plugins project to provide a collection of out-of-tree scheduler plugins based on the official framework1.
The scheduler-plugins repository is an official sidecar project that houses a set of plugins developed and used by large companies and the community. These plugins implement advanced scheduling capabilities that go beyond the default scheduler’s heuristics. The idea is that cluster operators can take these plugins and easily build a custom scheduler (or integrate them into the existing scheduler) without starting from scratch. The project provides pre-built container images and even Helm charts to deploy a scheduler with these plugins enabled. In essence, it’s an extension library for Kubernetes scheduling.
Some of the notable plugins in this repository include:
Capacity Scheduling: Allows sharing the cluster capacity among multiple groups or tenants with fairness. It introduces the concept of ElasticQuota, so that each team (or queue) gets a guaranteed quota but can borrow unused resources from others – enabling higher overall utilization while maintaining fairness. This is inspired by Hadoop YARN’s Capacity Scheduler.
Coscheduling: Implements gang scheduling for Pods that are parts of a collective job. Coscheduling ensures that a set of Pods (belonging to a PodGroup) are scheduled together or not at all. This is important for MPI, Spark, or AI jobs that require multiple Pods to start simultaneously to be effective. If not all Pods in the group can schedule, the scheduler will hold off (or keep them pending) until the gang can be formed, thus meeting the job’s SLA for all-or-nothing placement.
Node Resources and Node Resource Topology: These plugins improve awareness of node hardware details. For example, Node Resource Topology enables NUMA-aware scheduling – ensuring that a Pod with high CPU needs lands on a single NUMA node for better performance (reducing cross-memory access). It can also consider things like GPU topology or specialized device placement.
Preemption Toleration: This plugin refines the preemption behavior. It can be used to mark certain Pods or namespaces as “tolerant” to preemption, meaning they won’t trigger preemption or won’t be preempted under certain conditions. This is useful to prevent churn in specific scenarios or to implement graceful degradation policies.
Trimaran (Load-Aware Scheduling): Trimaran2 is a set of plugins that make the scheduler load-aware, meaning decisions consider actual runtime metrics (CPU utilization, etc.) in addition to just requested resources. For instance, a node might have CPU requests free, but if the node’s current CPU usage is very high (due to other pods bursting), a load-aware scheduler would avoid placing a new Pod there. Trimaran plugins use a load-watcher service to fetch metrics from sources like Metrics Server or Prometheus. By packing Pods based on real utilization gaps, this can improve efficiency and avoid “noisy neighbor” issues. This was developed from research by IBM and others.
Network-Aware Scheduling: The network-aware scheduling plugin3 factors in network topology and traffic costs. It might, for example, prefer scheduling Pods in the same availability zone or network segment to minimize latency or cross-zone data transfer costs. For network-intensive workloads, this can significantly improve performance and reduce cloud egress costs4.
Integration and usage: To use these plugins, you have two main options. One is to run a custom scheduler binary that has the plugins compiled in (the project provides a Kubernetes scheduler binary with all these plugins). The other is to run it as a secondary scheduler alongside the default. For example, you might deploy a second scheduler in your cluster (with a different name) that handles specific Pods (those that set spec.schedulerName
to your custom scheduler).
However, running multiple schedulers in one cluster comes with a caveat: they operate on the same pool of nodes and pods, which can lead to race conditions or conflicts. If two schedulers try to schedule pods onto the last available slot of a node at the same time, one pod will fail and get rescheduled (the kubelet only accepts one). In testing environments this conflict is rare, but in production a safer approach may be to replace the default scheduler entirely with a custom-built one that includes the plugins you need (so there’s only one scheduler process).
Under the hood, these plugins integrate via the Scheduling Framework’s extension points (filter, score, bind, etc.). For example, coscheduling hooks into the QueueSort and Filter phases to hold pods until their gang is ready; Trimaran adds a Score plugin that uses live metrics to score nodes. Because they’re compiled in, they execute as part of the scheduler’s normal in-process logic, which is efficient. The project is maintained in lockstep with Kubernetes versions (ensuring compatibility). Essentially, SIG Scheduling has provided this toolbox so that users can achieve advanced scheduling behavior without forked custom schedulers or extensive in-house development.
References#
kubernetes-sigs/scheduler-plugins: Repository for out-of-tree scheduler plugins based on scheduler framework | GitHub ↩︎
J. Santos, C. Wang, T. Wauters and F. De Turck, “Diktyo: Network-Aware Scheduling in Container-Based Clouds,” in IEEE Transactions on Network and Service Management, vol. 20, no. 4, pp. 4461-4477, Dec. 2023, doi: 10.1109/TNSM.2023.3271415 ↩︎