Kubernetes Scheduling: community schedulers in the ecosystem
The Kubernetes scheduler is the brain of the cluster, deciding which node runs each Pod. This is the fourth post in a series where I explore advanced scheduling mechanisms in Kubernetes. In this one, I give a broad overview of community-driven and vendor-supported custom schedulers built on top of Kubernetes. I focus on how these schedulers and plugins target specific goals like cost savings, SLA optimization, and performance tuning. I cover batch and ML-focused schedulers like Volcano, YuniKorn, and Koordinator, as well as research-driven systems like Poseidon/Firmament. I also look at cost-optimization strategies using bin-packing, spot instances, and descheduling, along with SLA-driven and topology-aware scheduling techniques. Finally, I reflect on the balance between community projects and vendor platforms, and how Kubernetes’s extensibility allows users to tailor scheduling to their workload and infrastructure needs.
Table of Contents#
- Under the hood
- The scheduling framework
- The scheduler-plugins project
- (You’re here) Community schedulers in the ecosystem
- Future trends
%%{ init: { 'logLevel': 'debug', 'theme': 'default', 'themeVariables': { 'git0': '#ff0000', 'git1': '#00ff00', } 'gitGraph': {'showBranches': true, 'showCommitLabel': true, 'mainBranchName': 'scheduling_series'} } }%% gitGraph: checkout main commit branch scheduling_series checkout scheduling_series commit checkout main commit merge scheduling_series tag:"under the hood" commit checkout scheduling_series commit commit checkout main commit merge scheduling_series tag:"scheduling framework" checkout scheduling_series commit commit checkout main merge scheduling_series tag:"scheduler-plugins" commit commit checkout main merge scheduling_series tag:"community schedulers" type: HIGHLIGHT checkout scheduling_series commit commit checkout main merge scheduling_series tag:"future trends" commit
Scheduler Plugins and Custom Schedulers in the Ecosystem#
The Kubernetes ecosystem has developed many custom schedulers and plugins to optimize specific use cases. Here we provide an overview of some existing solutions focused on cost savings, SLA (service-level agreement) optimization, and performance tuning.
Batch and ML Job Schedulers (Performance & SLA):
For AI/ML training, big data, and HPC workloads, scheduling requirements differ from typical microservices. These jobs are often long-running, resource-intensive, and may require scheduling multiple pods together. Several custom schedulers1 have arisen to handle this:
Volcano: An open-source batch scheduling system built on Kubernetes, now a CNCF project, aimed at high-performance workloads. Volcano introduces concepts like job queues, PodGroups (gang scheduling), priority-based job ordering, and fair-share scheduling among jobs. It’s essentially the successor to the earlier kube-batch project. Volcano improves execution of complex workloads (e.g. AI training) by ensuring all pods of a job start together and by enforcing fairness among jobs in different queues. It supports features like job dependencies, requeueing, and custom plugins for scheduling policies. Volcano is community-maintained and widely used in supercomputing and AI platforms to meet SLA requirements of batch jobs (e.g., not starting a training job until all its GPU pods can run, to avoid idle GPUs).
Apache YuniKorn: A universal scheduler originally from the Hadoop ecosystem (incubating at Apache) that can run on Kubernetes. YuniKorn2 is designed for multi-tenant big data clusters, providing fine-grained resource quotas via hierarchical queues and global fairness across frameworks (Spark, Flink, etc.). It essentially replaces YARN’s capacity scheduler with a K8s-native solution. YuniKorn focuses on efficient sharing and high throughput, making sure each tenant gets their fair share of resources and that cluster utilization stays high. For example, it can enforce that no single user exceeds certain usage, and it can pack batch jobs tightly.. One real-world trial showed that by using an advanced orchestrator (combining tuning and scheduling), they could achieve up to 77% cost savings or 45% performance acceleration for Spark and Airflow workflows on AWS3 – illustrating the potential gains from smarter scheduling of batch jobs.
Koordinator: An open-source scheduler (and suite of controllers) by ByteDance, aimed at co-locating diverse workloads (online services and batch jobs) on the same cluster for efficiency. Koordinator4 uses a QoS-based scheduling approach. It classifies pods by priority and QoS level (e.g., prod vs. batch), and allows lower-priority batch jobs to use leftover capacity without hurting the performance of high-priority services. It does this through elastic resource quotas, oversubscription (overcommit), and runtime resource isolation (e.g., using Linux cgroups and throttling via a Koordinator component on each node). Essentially, Koordinator will let batch jobs run on nodes that have spare CPU cycles, but if the primary service on that node suddenly needs CPU, Koordinator’s controllers can throttle or evict the batch pods to guarantee the service’s SLA. This kind of scheduling improves utilization (thus cost) while maintaining performance SLAs for critical services. Koordinator also integrates with the scheduler framework and provides profiles for different scenarios (AI, Spark, etc.). It’s community-driven (by ByteDance and others) and used in production at large scale.
Poseidon/Firmament: This is a combination of a K8s integration - Poseidon5 - with an academic scheduler - Firmament6. Firmament, developed from research, uses a min-cost max-flow algorithm on a flow network graph to model scheduling globally. Instead of scheduling one Pod at a time, it periodically solves an optimization problem considering all unscheduled pods and cluster state to find an optimal assignment. Poseidon is the bridge that feeds Kubernetes cluster state into Firmament and then executes the decisions. The promise of this approach is more optimal placement and higher cluster efficiency than the greedy default scheduler. It can handle complex constraints and rapidly changing workloads by re-flowing the network. However, it’s also more computationally heavy, which is why it’s beneficial mainly in large clusters where incremental optimality pays off. Poseidon/Firmament is open-source (a SIG project) and represents a more academic, global optimization take on scheduling. It’s a great example of how ideas from research (like flow networks for scheduling) can be applied to Kubernetes to reduce scheduling latency and improve utilization for very dynamic environments. This project is community-supported, though not as widely adopted due to its complexity.
Cost-Saving Schedulers:
Running Kubernetes efficiently can translate to significant cost savings, especially in cloud environments. Aside from right-sizing nodes and autoscaling, scheduling can play a direct role in cutting costs:
Binpacking and Spot Instance Optimization: Some custom schedulers or platform solutions focus on bin-packing pods to minimize the number of nodes in use. For instance, a scheduler that tightly packs pods (while respecting their requests) can free up whole nodes which can then be scaled down (in cloud) to save money. Vendor tools like Cast AI7 provide automation that influences scheduling for cost optimization – essentially a combination of a custom scheduler and cluster-autoscaler tuned for lowest cost. They will, for example, prefer scheduling new pods on nodes that are underutilized and even replace pods onto cheaper Spot instances when available. Cast AI’s platform claims to automatically bin-pack and choose the most cost-efficient nodes, even balancing on-demand vs. spot to achieve ~50% or more savings. While Cast AI is a vendor SaaS, the underlying idea can be achieved with open components: the Kubernetes scheduler-plugins project includes NodeResourceBudget and TargetLoadPacking (Trimaran) plugins which effectively do bin-packing by considering actual utilization gaps.
Reserved Instances and Affinity: Some enterprises run multiple workloads with different cost considerations (e.g., one cluster might have a mix of reserved and on-demand cloud instances). Custom scheduling rules can ensure that certain workloads run on prepaid or cheaper resources. For example, a company might tag nodes that are on a cheaper reservation or in a lower-cost region and use node affinity rules or a custom scheduler to prefer placing batch jobs there. This can be done with Kubernetes affinities and taints today (manually), but we’re seeing the emergence of controllers that automate this placement for cost awareness – effectively an “economy-aware” scheduling.
Deschedulers for cost: While not a scheduler per se, the Kubernetes Descheduler8 is a tool that evicts pods which are poorly placed (e.g., if a node becomes overcommitted or if pods could be packed better elsewhere over time). By evicting and letting the scheduler reassign pods, the descheduler can indirectly improve bin-packing post factum. This helps with cost optimization by continuously nudging the cluster towards a more efficient state, especially after scale-ups where pods might be spread out.
SLA and Topology Optimizations:
SLA optimization often means ensuring critical workloads get the resources and environment they need for reliable performance and uptime:
Topology-aware and Affinity scheduling: Kubernetes has built-in features like Pod affinity/anti-affinity and topology spread constraints, but custom plugins can enhance these. The Network-Aware Scheduling plugin we mentioned is one; another could be a latency-aware scheduler9 that knows certain nodes are in the same rack or same cloud zone and tries to place cooperating pods accordingly to meet latency SLOs. Telco and edge use-cases sometimes employ custom schedulers to ensure a Pod is scheduled on a node with, say, GPU and within a particular network latency to a base station. Projects like OpenYurt10 and KubeEdge11 extend scheduling to prefer edge nodes for certain pods, effectively scheduling with geographic topology in mind for SLA (this veers into multi-cluster territory as well).
Custom prioritization for SLAs: Some vendor-specific implementations allow scheduling based on business metrics or SLO policies. For example, a bank might have a custom scheduler that always schedules trading service pods on the most reliable nodes, using a custom label and plugin logic. These are typically in-house or vendor-provided customizations on top of the Kubernetes scheduler, ensuring certain Pods always win if there’s competition for resources (beyond just PriorityClasses). For community examples, the scheduler-plugins CapacityScheduling plugin can enforce per-tenant or per-service minimum guarantees – ensuring even if the cluster is busy, each important service can get its share (an SLA guarantee in terms of resource share).
Community vs Vendor Support:
Most of the projects mentioned (Volcano, YuniKorn, Koordinator, Poseidon/Firmament, Descheduler) are open-source and community-driven. They can typically be adopted on any Kubernetes cluster (on-prem or cloud). Meanwhile, vendors often incorporate custom scheduling logic into their managed services or offer proprietary tools:
Cloud provider enhancements: For example, GKE (Google Kubernetes Engine) has largely used the default scheduler, but Google’s internal Borg and Omega systems (which inspired K8s) use advanced techniques like machine learning for scheduling – some of those learnings trickle down into K8s features (like the recent introduce of “balanced allocation” scoring to spread resource usage). Karpenter12 is more of an advanced autoscaler but it influences scheduling by provisioning optimal nodes on the fly for incoming pods (thus kind of a scheduler + infrastructure tool). VMware’s Kubernetes-based platforms have features to schedule VMs for tenant isolation. These vendor solutions are often not exposed as configurable schedulers, but they achieve similar goals behind the scenes.
Third-party platforms: Apart from Cast AI, other companies like Turbonomic (IBM) have tools that analyze cluster performance and can adjust scheduling (via moving pods) for optimization. These act externally, by recommending or performing evictions and rescheduling to meet SLAs or reduce cost.
The good news is Kubernetes’s extensibility means that you’re not stuck with “one-size-fits-all.” If the default scheduler doesn’t meet your needs (be it packing efficiency, fairness, or specialized hardware handling), you can likely find an existing project or plugin that does, or write your own using the scheduling framework. The ecosystem is rich and growing, addressing niche requirements from running hyper-large batch jobs to ensuring real-time critical pods never miss a deadline.
References#
11 Kubernetes Custom Schedulers You Should Use | overcast blog ↩︎
Cost savings and Performance Acceleration in AWS with Apache YuniKorn ↩︎
Poseidon-Firmament Scheduler – Flow Network Graph Based Scheduler ↩︎
Gog, Ionel, et al. “Firmament: Fast, centralized cluster scheduling at scale.” 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 2016. ↩︎
Di Stefano, Alessandro, Antonella Di Stefano, and Giovanni Morana. “Improving QoS through network isolation in PaaS.” Future Generation Computer Systems 131 (2022): 91-105 ↩︎