A perspective on the current and future state of Kubernetes scheduling
Kubernetes scheduling is the brain of the cluster, deciding which node runs each Pod. In this post, I explore advanced scheduling mechanisms in Kubernetes. I start with how the default scheduler works under the hood, then dive into the scheduler-plugins project for extending its capabilities. I review custom schedulers and plugins in the ecosystem that focus on cost savings, SLA optimization, and performance tuning, noting which are community-supported or vendor-specific. Finally, I look at future trends in Kubernetes scheduling, from AI-driven algorithms to multi-cluster and energy-aware schedulers. This post brings together the key ideas from the full series I wrote on Kubernetes scheduling.
Kubernetes Scheduler#
The Kubernetes default scheduler1 (kube-scheduler) is a control-plane component that watches for unscheduled Pods and assigns each to an optimal Node. Scheduling a Pod involves two main phases: a scheduling cycle (selecting a suitable node) and a binding cycle (confirming the assignment).
At a high level, the scheduler’s workflow can be broken down into queueing, filtering, scoring, and binding:
Queueing: New Pods without a Node are added to a priority queue2. The scheduler pulls Pods from the
ActiveQ
queue to process. Since Kubernetes 1.263, pods are added to theActiveQ
queue only ifPreEnqueue
plugins determine if the Pod is ready for scheduling. If Pods are scheduling gated, they are added to theUnschedulableQ
queue. Unschedulable pods can be moved to thepodBackoffQ
to retry scheduling on request, for example due to the removal of Scheduling Gates or an event triggered when a new node joins the cluster. Pods that have been unschedulable for a while and need to be retried later. Kubernetes 1.32 promoted a recent feature - QueueingHint - to beta. It is a callback function for deciding whether a pod might become schedulable and requeueing it promptly to theActiveQ
queue. This design balances fairness (no Pod waits forever) with honoring priority levels. By default, Pods are prioritized by their.spec.priority
(higher priority first) and within the same priority by FIFO order, though this is configurable via aQueueSort
plugin. This ensures important Pods get scheduled sooner, while maintaining fairness by preventing starvation of lower-priority pods.Filtering (Predicates): In the filtering phase, the scheduler finds all feasible nodes for the Pod. It evaluates each candidate node against a set of predicate checks, Filter plugins. Filters check Node taints and Pod tolerations, affinity/anti-affinity rules, volume topology, etc. These predicate functions are essentially boolean conditions – a node must pass all filters to be considered viable. If no node passes filtering, the Pod cannot be scheduled (it remains pending).
Scoring (Priorities): After filtering, usually multiple nodes are still eligible. The scoring phase ranks the feasible nodes to choose the most suitable one. The scheduler calculates a score for each node by running various scoring functions (now Score plugins). These scores (typically 0–100) assess how well a node satisfies soft preferences. For example, one scoring plugin favors spreading Pods across nodes for availability, while another (ImageLocality) gives higher score to nodes that already have the Pod’s container image to reduce pull time. Each plugin contributes to a node’s score, and Kubernetes sums and normalizes them to pick the highest scoring node 4. If multiple nodes tie with top score, the scheduler will select one at random to break the tie, which avoids bias and improves fairness.
Binding: Once the scheduler selects the best Node, it issues a Bind action – essentially writing the decision back to the API server by setting the Pod’s
.spec.nodeName
to that Node. At this point, the Pod is considered scheduled and the kubelet on the target node will eventually start the Pod’s containers. The binding step is a critical section – Kubernetes uses optimistic concurrency and retries to handle the case where the cluster state changed (e.g., another pod took the resources). In normal operation, this all happens quickly and transparently.Preemption (if needed): An additional step, preemption5, comes into play when no feasible node is found for a Pod but there are lower-priority pods occupying resources. Kubernetes will attempt to evict (preempt) some lower-priority Pods to free resources for a high-priority Pod that is pending. Preemption is only considered for Pods with a PriorityClass, and it respects certain policies (it won’t evict Pods with equal or higher priority, and it tries to minimize the number of victims). This mechanism helps honor quality of service and SLA for critical workloads – e.g. if a vital Pod can’t schedule due to resource saturation by others, the scheduler may remove less important Pods to make room. Preemption is triggered in the PostFilter phase of the scheduling cycle (after failing to find a node), ensuring that the scheduler tries all normal scheduling means first.
The default scheduling algorithm must maintain a balance between latency, throughput, and accuracy of placement in large clusters. Kubernetes scheduling is a NP-hard6 optimization problem, but the scheduler is designed to work efficiently even with thousands of nodes and pods. For scalability, the scheduler doesn’t examine every node for every Pod – by default, it samples a percentage of nodes to evaluate when the cluster is very large. This percentageOfNodesToScore
parameter helps cap the work per scheduling cycle, trading a bit of optimality for speed at scale. In practice, this means in a 5,000-node cluster, the scheduler might score only 10% of nodes (500 nodes) for each Pod, dramatically reducing scheduling latency with minimal impact on quality.
The scheduler’s throughput (pods scheduled per second) becomes important at scale. Kubernetes can schedule dozens of pods per second under typical conditions, but heavy custom plugins or extenders can slow it down. Ensuring fairness in scheduling is also critical in multi-tenant clusters. Kubernetes relies on the priority queue and optional Pod quotas to prevent any one workload from hogging all resources. For example, the scheduler by default sorts Pods by priority, so a burst of low-priority Pods won’t delay a high-priority Pod. At the same time, the random tie-breaking and round-robin binding of equivalent pods helps distribute opportunities. Advanced policies like Capacity Scheduling can enforce fairness more explicitly across teams or queues. In summary, the default scheduler aims to be fast, reasonably fair, and “good enough” in placement for general workloads, while allowing extension for specialized needs.
%%{init: { "theme": "redux-dark", "flowchart": { "htmlLabels": true, "curve": "linear", "defaultRenderer": "elk" } } }%% flowchart TB subgraph Scheduling_Cycle["Scheduling Cycle"] Q(["ActiveQ"]) n1["PreEnqueue"] F["Filter: discard infeasible nodes"] X["PostFilter: Preemption?"] S["Score: rank feasible nodes"] H["Select highest score node"] R["Reserve node"] n2(["UnschedulableQ"]) n3(["PodBackOffQ"]) n5["QueueSort"] end subgraph Binding_Cycle["Binding Cycle"] B["Bind Pod to Node"] end n1 --> Q & n2 Q --> n5 n5 --> F F -- no nodes --> X X -- "evict low-priority pod" --> Q F --> S S --> H H --> R R --> B P["New Pod"] --> n1 X -- no eviction --> n2 n2 -- external events or QueueingHint --> n3 n3 --> Q
Kubernetes Scheduling Framework#
Kubernetes scheduling is batch-oriented (one Pod at a time per scheduler thread). It does not globally optimize across all Pods in a single pass; instead, it makes a series of local decisions. This can occasionally lead to suboptimal placements (e.g., a Pod might take the “last slot” on a node that a later high-priority Pod could have used). Efforts like Pod preemption and the variety of scoring heuristics try to mitigate this. Some advanced schedulers take a more global or intelligent approach to placement.
In older Kubernetes versions, the scheduling logic was described in terms of Predicates (filtering functions) and Priorities (scoring functions). Administrators could even configure custom scheduling policies by enabling or disabling certain predicates/priorities. Modern Kubernetes has evolved this into the Scheduling Framework, where each stage (filter, score, bind, etc.) is an extension point for plugins. However, the core ideas remain the same. The default scheduler still uses a set of built-in filters (e.g. node resource checks, taint tolerations) and scoring rules (e.g. spread, resource balance), along with a deterministic workflow of phases.
Another concept is extenders – a pre-framework way to run external scheduling logic via webhook calls. Extenders allow an external service to filter or score nodes in addition to the default process7. While extenders are still supported, they have drawbacks (additional latency for HTTP calls, complexity, and only limited extension points). Kubernetes 1.19+ replaces most use cases for extenders with in-process plugins, which are more efficient.
More recently, the introduced PreEnqueue
plugins and the SchedulingGates
feature allow for more complex scheduling logic. PreEnqueue
plugins can determine if a Pod is ready to be scheduled, while SchedulingGates
can block scheduling until certain conditions are met (e.g., waiting for a node to be ready). SchedulingGates
can be added at creation time either manually or automatically through mutating webhooks. The scheduler will not schedule a Pod until all gates are removed.
External controllers are responsible for removing the gates. This allows for more complex scheduling scenarios, such as waiting for a specific condition to be met before scheduling a Pod.
Operators like Kueue and the Openshift’s Multiarch Tuning Operator benefit from this feature. Kueue
is a Kubernetes-native job queueing system that allows users to define queues and scheduling policies for HPC/AI workloads. The Multiarch Tuning Operator allows users to define scheduling policies for multi-architecture clusters, ensuring that workloads are scheduled on the appropriate architecture.
The Kubernetes scheduler-plugins Project#
As Kubernetes deployments grew in scale and diversity, the need arose to customize scheduling for specific scenarios (batch jobs, real-time workloads, multi-tenant fairness, etc.). While the Scheduling Framework enables custom plugins, writing and maintaining those in production can be non-trivial. The Kubernetes SIG Scheduling created the kubernetes-sigs/scheduler-plugins project to provide a collection of out-of-tree scheduler plugins based on the official framework8.
The scheduler-plugins repository is an official sidecar project that houses a set of plugins developed and used by large companies and the community. These plugins implement advanced scheduling capabilities that go beyond the default scheduler’s heuristics. The idea is that cluster operators can take these plugins and easily build a custom scheduler (or integrate them into the existing scheduler) without starting from scratch. The project provides pre-built container images and even Helm charts to deploy a scheduler with these plugins enabled. In essence, it’s an extension library for Kubernetes scheduling.
Some of the notable plugins in this repository include:
Capacity Scheduling: Allows sharing the cluster capacity among multiple groups or tenants with fairness. It introduces the concept of ElasticQuota, so that each team (or queue) gets a guaranteed quota but can borrow unused resources from others – enabling higher overall utilization while maintaining fairness. This is inspired by Hadoop YARN’s Capacity Scheduler.
Coscheduling: Implements gang scheduling for Pods that are parts of a collective job. Coscheduling ensures that a set of Pods (belonging to a PodGroup) are scheduled together or not at all. This is important for MPI, Spark, or AI jobs that require multiple Pods to start simultaneously to be effective. If not all Pods in the group can schedule, the scheduler will hold off (or keep them pending) until the gang can be formed, thus meeting the job’s SLA for all-or-nothing placement.
Node Resources and Node Resource Topology: These plugins improve awareness of node hardware details. For example, Node Resource Topology enables NUMA-aware scheduling – ensuring that a Pod with high CPU needs lands on a single NUMA node for better performance (reducing cross-memory access). It can also consider things like GPU topology or specialized device placement.
Preemption Toleration: This plugin refines the preemption behavior. It can be used to mark certain Pods or namespaces as “tolerant” to preemption, meaning they won’t trigger preemption or won’t be preempted under certain conditions. This is useful to prevent churn in specific scenarios or to implement graceful degradation policies.
Trimaran (Load-Aware Scheduling): Trimaran9 is a set of plugins that make the scheduler load-aware, meaning decisions consider actual runtime metrics (CPU utilization, etc.) in addition to just requested resources. For instance, a node might have CPU requests free, but if the node’s current CPU usage is very high (due to other pods bursting), a load-aware scheduler would avoid placing a new Pod there. Trimaran plugins use a load-watcher service to fetch metrics from sources like Metrics Server or Prometheus. By packing Pods based on real utilization gaps, this can improve efficiency and avoid “noisy neighbor” issues. This was developed from research by IBM and others.
Network-Aware Scheduling: The network-aware scheduling plugin10 factors in network topology and traffic costs. It might, for example, prefer scheduling Pods in the same availability zone or network segment to minimize latency or cross-zone data transfer costs. For network-intensive workloads, this can significantly improve performance and reduce cloud egress costs11.
Integration and usage: To use these plugins, you have two main options. One is to run a custom scheduler binary that has the plugins compiled in (the project provides a Kubernetes scheduler binary with all these plugins). The other is to run it as a secondary scheduler alongside the default. For example, you might deploy a second scheduler in your cluster (with a different name) that handles specific Pods (those that set spec.schedulerName
to your custom scheduler).
However, running multiple schedulers in one cluster comes with a caveat: they operate on the same pool of nodes and pods, which can lead to race conditions or conflicts. If two schedulers try to schedule pods onto the last available slot of a node at the same time, one pod will fail and get rescheduled (the kubelet only accepts one). In testing environments this conflict is rare, but in production a safer approach may be to replace the default scheduler entirely with a custom-built one that includes the plugins you need (so there’s only one scheduler process).
Under the hood, these plugins integrate via the Scheduling Framework’s extension points (filter, score, bind, etc.). For example, coscheduling hooks into the QueueSort and Filter phases to hold pods until their gang is ready; Trimaran adds a Score plugin that uses live metrics to score nodes. Because they’re compiled in, they execute as part of the scheduler’s normal in-process logic, which is efficient. The project is maintained in lockstep with Kubernetes versions (ensuring compatibility). Essentially, SIG Scheduling has provided this toolbox so that users can achieve advanced scheduling behavior without forked custom schedulers or extensive in-house development.
Scheduler Plugins and Custom Schedulers in the Ecosystem#
The Kubernetes ecosystem has developed many custom schedulers and plugins to optimize specific use cases. Here we provide an overview of some existing solutions focused on cost savings, SLA (service-level agreement) optimization, and performance tuning.
Batch and ML Job Schedulers (Performance & SLA):
For AI/ML training, big data, and HPC workloads, scheduling requirements differ from typical microservices. These jobs are often long-running, resource-intensive, and may require scheduling multiple pods together. Several custom schedulers12 have arisen to handle this:
Volcano: An open-source batch scheduling system built on Kubernetes, now a CNCF project, aimed at high-performance workloads. Volcano introduces concepts like job queues, PodGroups (gang scheduling), priority-based job ordering, and fair-share scheduling among jobs. It’s essentially the successor to the earlier kube-batch project. Volcano improves execution of complex workloads (e.g. AI training) by ensuring all pods of a job start together and by enforcing fairness among jobs in different queues. It supports features like job dependencies, requeueing, and custom plugins for scheduling policies. Volcano is community-maintained and widely used in supercomputing and AI platforms to meet SLA requirements of batch jobs (e.g., not starting a training job until all its GPU pods can run, to avoid idle GPUs).
Apache YuniKorn: A universal scheduler originally from the Hadoop ecosystem (incubating at Apache) that can run on Kubernetes. YuniKorn13 is designed for multi-tenant big data clusters, providing fine-grained resource quotas via hierarchical queues and global fairness across frameworks (Spark, Flink, etc.). It essentially replaces YARN’s capacity scheduler with a K8s-native solution. YuniKorn focuses on efficient sharing and high throughput, making sure each tenant gets their fair share of resources and that cluster utilization stays high. For example, it can enforce that no single user exceeds certain usage, and it can pack batch jobs tightly.. One real-world trial showed that by using an advanced orchestrator (combining tuning and scheduling), they could achieve up to 77% cost savings or 45% performance acceleration for Spark and Airflow workflows on AWS14 – illustrating the potential gains from smarter scheduling of batch jobs.
Koordinator: An open-source scheduler (and suite of controllers) by ByteDance, aimed at co-locating diverse workloads (online services and batch jobs) on the same cluster for efficiency. Koordinator15 uses a QoS-based scheduling approach. It classifies pods by priority and QoS level (e.g., prod vs. batch), and allows lower-priority batch jobs to use leftover capacity without hurting the performance of high-priority services. It does this through elastic resource quotas, oversubscription (overcommit), and runtime resource isolation (e.g., using Linux cgroups and throttling via a Koordinator component on each node). Essentially, Koordinator will let batch jobs run on nodes that have spare CPU cycles, but if the primary service on that node suddenly needs CPU, Koordinator’s controllers can throttle or evict the batch pods to guarantee the service’s SLA. This kind of scheduling improves utilization (thus cost) while maintaining performance SLAs for critical services. Koordinator also integrates with the scheduler framework and provides profiles for different scenarios (AI, Spark, etc.). It’s community-driven (by ByteDance and others) and used in production at large scale.
Poseidon/Firmament: This is a combination of a K8s integration - Poseidon16 - with an academic scheduler - Firmament17. Firmament, developed from research, uses a min-cost max-flow algorithm on a flow network graph to model scheduling globally. Instead of scheduling one Pod at a time, it periodically solves an optimization problem considering all unscheduled pods and cluster state to find an optimal assignment. Poseidon is the bridge that feeds Kubernetes cluster state into Firmament and then executes the decisions. The promise of this approach is more optimal placement and higher cluster efficiency than the greedy default scheduler. It can handle complex constraints and rapidly changing workloads by re-flowing the network. However, it’s also more computationally heavy, which is why it’s beneficial mainly in large clusters where incremental optimality pays off. Poseidon/Firmament is open-source (a SIG project) and represents a more academic, global optimization take on scheduling. It’s a great example of how ideas from research (like flow networks for scheduling) can be applied to Kubernetes to reduce scheduling latency and improve utilization for very dynamic environments. This project is community-supported, though not as widely adopted due to its complexity.
Cost-Saving Schedulers:
Running Kubernetes efficiently can translate to significant cost savings, especially in cloud environments. Aside from right-sizing nodes and autoscaling, scheduling can play a direct role in cutting costs:
Binpacking and Spot Instance Optimization: Some custom schedulers or platform solutions focus on bin-packing pods to minimize the number of nodes in use. For instance, a scheduler that tightly packs pods (while respecting their requests) can free up whole nodes which can then be scaled down (in cloud) to save money. Vendor tools like Cast AI18 provide automation that influences scheduling for cost optimization – essentially a combination of a custom scheduler and cluster-autoscaler tuned for lowest cost. They will, for example, prefer scheduling new pods on nodes that are underutilized and even replace pods onto cheaper Spot instances when available. Cast AI’s platform claims to automatically bin-pack and choose the most cost-efficient nodes, even balancing on-demand vs. spot to achieve ~50% or more savings. While Cast AI is a vendor SaaS, the underlying idea can be achieved with open components: the Kubernetes scheduler-plugins project includes NodeResourceBudget and TargetLoadPacking (Trimaran) plugins which effectively do bin-packing by considering actual utilization gaps.
Reserved Instances and Affinity: Some enterprises run multiple workloads with different cost considerations (e.g., one cluster might have a mix of reserved and on-demand cloud instances). Custom scheduling rules can ensure that certain workloads run on prepaid or cheaper resources. For example, a company might tag nodes that are on a cheaper reservation or in a lower-cost region and use node affinity rules or a custom scheduler to prefer placing batch jobs there. This can be done with Kubernetes affinities and taints today (manually), but we’re seeing the emergence of controllers that automate this placement for cost awareness – effectively an “economy-aware” scheduling.
Deschedulers for cost: While not a scheduler per se, the Kubernetes Descheduler19 is a tool that evicts pods which are poorly placed (e.g., if a node becomes overcommitted or if pods could be packed better elsewhere over time). By evicting and letting the scheduler reassign pods, the descheduler can indirectly improve bin-packing post factum. This helps with cost optimization by continuously nudging the cluster towards a more efficient state, especially after scale-ups where pods might be spread out.
SLA and Topology Optimizations:
SLA optimization often means ensuring critical workloads get the resources and environment they need for reliable performance and uptime:
Topology-aware and Affinity scheduling: Kubernetes has built-in features like Pod affinity/anti-affinity and topology spread constraints, but custom plugins can enhance these. The Network-Aware Scheduling plugin we mentioned is one; another could be a latency-aware scheduler20 that knows certain nodes are in the same rack or same cloud zone and tries to place cooperating pods accordingly to meet latency SLOs. Telco and edge use-cases sometimes employ custom schedulers to ensure a Pod is scheduled on a node with, say, GPU and within a particular network latency to a base station. Projects like OpenYurt21 and KubeEdge22 extend scheduling to prefer edge nodes for certain pods, effectively scheduling with geographic topology in mind for SLA (this veers into multi-cluster territory as well).
Custom prioritization for SLAs: Some vendor-specific implementations allow scheduling based on business metrics or SLO policies. For example, a bank might have a custom scheduler that always schedules trading service pods on the most reliable nodes, using a custom label and plugin logic. These are typically in-house or vendor-provided customizations on top of the Kubernetes scheduler, ensuring certain Pods always win if there’s competition for resources (beyond just PriorityClasses). For community examples, the scheduler-plugins CapacityScheduling plugin can enforce per-tenant or per-service minimum guarantees – ensuring even if the cluster is busy, each important service can get its share (an SLA guarantee in terms of resource share).
Community vs Vendor Support:
Most of the projects mentioned (Volcano, YuniKorn, Koordinator, Poseidon/Firmament, Descheduler) are open-source and community-driven. They can typically be adopted on any Kubernetes cluster (on-prem or cloud). Meanwhile, vendors often incorporate custom scheduling logic into their managed services or offer proprietary tools:
Cloud provider enhancements: For example, GKE (Google Kubernetes Engine) has largely used the default scheduler, but Google’s internal Borg and Omega systems (which inspired K8s) use advanced techniques like machine learning for scheduling – some of those learnings trickle down into K8s features (like the recent introduce of “balanced allocation” scoring to spread resource usage). Karpenter23 is more of an advanced autoscaler but it influences scheduling by provisioning optimal nodes on the fly for incoming pods (thus kind of a scheduler + infrastructure tool). VMware’s Kubernetes-based platforms have features to schedule VMs for tenant isolation. These vendor solutions are often not exposed as configurable schedulers, but they achieve similar goals behind the scenes.
Third-party platforms: Apart from Cast AI, other companies like Turbonomic (IBM) have tools that analyze cluster performance and can adjust scheduling (via moving pods) for optimization. These act externally, by recommending or performing evictions and rescheduling to meet SLAs or reduce cost.
The good news is Kubernetes’s extensibility means that you’re not stuck with “one-size-fits-all.” If the default scheduler doesn’t meet your needs (be it packing efficiency, fairness, or specialized hardware handling), you can likely find an existing project or plugin that does, or write your own using the scheduling framework. The ecosystem is rich and growing, addressing niche requirements from running hyper-large batch jobs to ensuring real-time critical pods never miss a deadline.
Future Trends in Kubernetes Scheduling#
Looking ahead, several exciting trends and research areas are poised to shape the future of Kubernetes scheduling:
AI-Driven Scheduling: One of the hottest topics is applying machine learning or AI to scheduling decisions24. The idea is that instead of static heuristics (if node has X free, score Y), an AI model could learn from historical data to predict the best placement for a workload. For example, a neural network could be trained to predict the performance of a web service if placed on a given node (based on that node’s current load, hardware, etc.)25, and the scheduler could use that to choose a node that minimizes start-up latency or maximizes the app’s throughput. Researchers have already prototyped custom schedulers that use reinforcement learning (RL) or decision tree models to schedule pods. Early results26 show it’s possible to have an ML model make scheduling decisions that improve resource utilization and lower latency compared to the default scheduler. For instance, one approach trains a model to predict which node will result in the fastest pod initialization for a given application, effectively optimizing away cold-start delays.
AI-driven scheduling could also mean dynamic adaptation27: the scheduler might adjust its strategy based on current cluster conditions (learned patterns). A reinforcement learning scheduler could continuously fine-tune its node scoring function by observing reward signals (like pods running successfully, or overall cluster efficiency). This is still an emerging field – challenges include the complexity of model training, the need for simulation (you can’t trial-and-error on a production cluster easily), and ensuring the model’s decisions are safe and explainable. However, as clusters grow and scheduling scenarios get more complex (think mixed workloads, edge, and cloud), AI might help juggle those factors better than static algorithms. We might even see integration of AI at the periphery: for example, using predictive analytics to proactively scale out nodes and schedule pods before a known traffic spike (predictive autoscaling combined with scheduling). Some projects and papers like Octopus26 or others in academic conferences are exploring these ideas. In the next 5 years, it’s reasonable to expect at least optional AI guidance in Kubernetes scheduling – perhaps as a “hint” mechanism where an external service suggests placements to the scheduler based on learned data.
Multi-Cluster and Federated Scheduling: As enterprises move to hybrid cloud and multi-region deployments, the concept of scheduling across multiple clusters is gaining attention. Federated scheduling (a.k.a. scheduling in Kubernetes Federation or via systems like Karmada) means deciding not just which node but which cluster a pod should run in. Projects like KubeFed (Kubernetes Federation v2) and Karmada aim to make multiple clusters act in concert. Karmada28, for example, provides “multi-policy, multi-cluster scheduling” – you can define a policy for an application that says how to distribute its replicas across datacenters or clouds. The scheduler (Karmada’s scheduler component) will then place some replicas in cluster A, some in cluster B, according to rules (like spread 50/50 for high availability, or choose the cluster in a certain region for latency). It also supports failover: if one cluster goes down, it can reschedule those pods to another cluster.
We expect multi-cluster scheduling to become more prevalent, possibly with integration into the core Kubernetes API in the future. For example, a user might submit a workload to a global scheduler, and it decides the best cluster (based on capacity, cost, or policy) to run it. This involves solving a higher-level scheduling problem – not just nodes, but clusters as the targets. Federation v1 struggled with this, but newer systems are more promising. We might see standardized APIs for expressing multi-cluster affinity (“run 3 copies in US-West, 2 copies in EU-Central”) and global schedulers honoring them. Multi-cluster schedulers will also likely incorporate network awareness (so that an app and its database might get scheduled to the same cluster to minimize WAN traffic) and data locality (preferring the cluster where needed data is present). Multi-cluster scheduling is crucial for federated machine learning, disaster recovery (DR), and geo-distributed apps. Projects like Open Cluster Management (OCM) and others in CNCF are addressing parts of this puzzle too. Overall, expect Kubernetes to more seamlessly handle scheduling in a multi-cluster world.
Energy-Aware and Sustainable Scheduling: With growing focus on sustainability, there’s interest in making schedulers minimize energy consumption and carbon footprint. Kubernetes itself is being used in data centers with renewable energy and in edge environments where power is limited. An energy-aware scheduler might take into account the power efficiency of nodes or the current carbon intensity of the grid powering each node. For example, if one availability zone is currently drawing cleaner power (more renewable mix) than another, a carbon-aware scheduler could prefer scheduling new workloads in that zone to reduce emissions. Similarly, it might consolidate workloads on fewer nodes during off-peak hours so that other nodes can be turned off (saving energy), then spread them out during peak to meet performance needs. A recent FOSDEM talk discussed using the Kepler29 project (Kubernetes Efficient Power Level Exporter) to get per-container energy metrics and then scheduling based on those.
One concrete concept is carbon-aware scheduling: delay or advance certain jobs based on when electricity is greenest. For instance, a batch job like ML model training could wait until nighttime when wind power is abundant and then run, thereby using low-carbon energy30. If a cluster spans multiple regions, the scheduler could choose a region that currently has the lowest carbon intensity to run a job. This requires integrating external data (carbon intensity feeds, e.g., via APIs like WattTime or ElectricityMap). Some early experiments and tools are emerging for this (even outside K8s, e.g., Nomad has plugins for it). We expect Kubernetes to hop on – perhaps via a scheduling plugin that periodically tags nodes with an “eco-score” and then a custom score plugin that prefers higher scores (greener nodes) for certain workloads.
Energy-aware scheduling also includes power capping strategies: if a data center is running hot, a scheduler might spread workloads to avoid high power draw in any one rack (preventing overheating). Or in edge, if a site is on battery, the scheduler might avoid placing heavy workloads there until power is stable. All of these require a tight loop between telemetry (like what Kepler provides) and decision-making. The Kubernetes community is very much aware of this trend – we can expect future KEPs (enhancement proposals) focusing on sustainability. In a broader sense, scheduling for sustainability might become as important as scheduling for performance is today.
Scheduler Extensibility and Profiles: On a meta level, we will likely see the scheduler become even more extensible. The Scheduling Framework was a big step (allowing out-of-tree plugins). Future improvements might include making it easier to run multiple scheduling profiles in one cluster. This could be expanded so that, say, batch pods automatically use a profile with different plugins (perhaps via a scheduler annotation or class). The community might also work on making scheduler customization more dynamic – loading plugins at runtime or via configuration, rather than requiring a custom binary. This would let cluster admins and researchers experiment with new scheduling behaviors on the fly.
Chaos and Resilience: As scheduling becomes more complex, there’s also interest in making the scheduler itself more resilient. We might see work on schedular high-availability (today only one scheduler leads at a time) or even decentralized scheduling (multiple schedulers coordinating, which was the idea behind Omega). There’s research on eliminating the single bottleneck by having many schedulers work in parallel on disjoint sets of pods, then reconciling – this could come back in some form, especially for huge clusters or federated scenarios and with the rise of Kubernetes as the de-fact standard platform for AI-workloads.
The future of Kubernetes scheduling is headed toward smarter, more global, and more principled decisions: using AI to learn from the past, coordinating across clusters, and meeting objectives like cost and energy efficiency. It’s an active area of development in both industry and academia. Kubernetes, evolving from the lessons of Borg and Omega, continues to incorporate these advancements. For Kubernetes engineers and researchers, it’s an exciting domain – the scheduler is becoming a pluggable platform for innovation. We recommend keeping an eye on KEPs in SIG Scheduling and upcoming papers from KubeCon + research conferences to stay ahead in this space. The quest for the “optimal” scheduler is ongoing, and Kubernetes is at the forefront of bringing these theoretical ideas into practical, running code.
Table of Contents from the Kubernetes Scheduling Series#
- Under the hood
- The scheduling framework
- The scheduler-plugins project
- Community schedulers in the ecosystem
- Future trends
References#
Intel Granulate Technology - A Deep Dive into Kubernetes Scheduling ↩︎
The Burgeoning Kubernetes Scheduling System – Part 1: Scheduling Framework - Alibaba Cloud Community ↩︎
kubernetes-sigs/scheduler-plugins: Repository for out-of-tree scheduler plugins based on scheduler framework | GitHub ↩︎
J. Santos, C. Wang, T. Wauters and F. De Turck, “Diktyo: Network-Aware Scheduling in Container-Based Clouds,” in IEEE Transactions on Network and Service Management, vol. 20, no. 4, pp. 4461-4477, Dec. 2023, doi: 10.1109/TNSM.2023.3271415 ↩︎
11 Kubernetes Custom Schedulers You Should Use | overcast blog ↩︎
Cost savings and Performance Acceleration in AWS with Apache YuniKorn ↩︎
Poseidon-Firmament Scheduler – Flow Network Graph Based Scheduler ↩︎
Gog, Ionel, et al. “Firmament: Fast, centralized cluster scheduling at scale.” 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 2016. ↩︎
Di Stefano, Alessandro, Antonella Di Stefano, and Giovanni Morana. “Improving QoS through network isolation in PaaS.” Future Generation Computer Systems 131 (2022): 91-105 ↩︎
Zeineb Rejiba and Javad Chamanara. 2022. Custom Scheduling in Kubernetes: A Survey on Common Problems and Solution Approaches. ACM Comput. Surv. 55, 7, Article 151 (July 2023), 37 pages. https://doi.org/10.1145/3544788 ↩︎
Dakić V, Đambić G, Slovinac J, Redžepagić J. Optimizing Kubernetes Scheduling for Web Applications Using Machine Learning. Electronics. 2025; 14(5):863. https://doi.org/10.3390/electronics14050863 ↩︎
Mahapatra, Rohan, et al. “Exploring efficient ml-based scheduler for microservices in heterogenous clusters.” Machine Learning for Computer Architecture and Systems 2022. 2022 ↩︎ ↩︎
J. Jeon, S. Park, B. Jeong and Y. -S. Jeong, “Efficient Container Scheduling With Hybrid Deep Learning Model for Improved Service Reliability in Cloud Computing,” in IEEE Access, vol. 12, pp. 65166-65177, 2024, doi: 10.1109/ACCESS.2024.3396652 ↩︎
FOSDEM 2023 - Carbon Intensity Aware Scheduling in Kubernetes ↩︎
Carbon Aware Scheduling on Nomad and Kubernetes - Green Web Foundation ↩︎