KubeCon + CloudNativeCon Europe 2025 just concluded in London, bringing together thousands of cloud-native engineers, maintainers, and enthusiasts. As a local Distributed Systems Engineer involved in the Kubernetes community and ecosystem, attending in person was both energizing and insightful.

A central theme emerged across sessions: Kubernetes is rapidly evolving beyond microservices, adapting to support batch workloads, AI/ML training, HPC scenarios, and global-scale multi-cluster deployments. This shift isn’t just technical - it’s reshaping the cloud-native landscape and redefining how we think about workload orchestration, scheduling, and autoscaling.

Kubernetes may not have been built for these new frontiers - but it’s catching up fast.

Kubernetes wasn’t built for Batch Jobs - But it’s getting there#

The Kubernetes eco-system is embracing the needs of AI/ML by evolving scheduling and resource management. Many companies and AI/ML engineers prefer the Kubernetes ecosystem and its APIs, but Kubernetes alone lacks some advanced scheduling and tight job coordination features typical in HPC scenarios. Projects like Volcano are adding features specifically for AI, and core SIG Scheduling is considering how to incorporate batch concepts in a Kubernetes-native way. Gang Scheduling is back in the game of distributed systems today. The future of Kubernetes scheduling looks “pluggable” and workload-aware: standard Kubernetes for general workloads, augmented by specialized schedulers or plugins for AI, big data, etc., all coexisting, competing and cooperating on the same cluster.

It’s no secret that Kubernetes was initially engineered for long-running services, not batch jobs. The platform’s core design favors “microservice world” (always-on deployments) over “batch world” (finite-run jobs)1. This philosophical bias has historically made running HPC or AI workflows on Kubernetes harder than it should be. At KubeCon 2025, a clear message emerged: this gap is closing. A host of talks and projects demonstrated how the community is reinventing Kubernetes scheduling and orchestration to handle batch and AI workloads effectively, without abandoning Kubernetes’ declarative and cloud-native principles.

One standout session, “A Comparative Analysis of Kueue, Volcano, and YuniKorn”2 by Wei Huang (Apple) and Shiming Zhang (DaoCloud), tackled exactly this challenge. Batch workloads - from big data analytics to machine learning model training - share demanding requirements: they need to queue and batch-schedule jobs, isolate resources between tenants, maximize utilization, and still meet job deadlines or SLAs2. Wei and Shiming compared three open-source solutions born to address these needs: Kueue3, Volcano4, and Apache YuniKorn5. Each takes a different approach to batch scheduling on Kubernetes, and the talk offered a comprehensive and quantitative comparison about the trade-offs.

Running batch jobs efficiently isn’t just a software scheduling problem - it’s also about leveraging existing HPC ecosystems. One of the more eye-opening talks was “Slinky: Slurm in Kubernetes”6 by Tim Wickberg, CTO of SchedMD - the company behind Slurm. For those not familiar, Slurm7 is a venerable open-source HPC workload manager used in supercomputers and clusters worldwide. It excels at scheduling Message Passing Interface (MPI) jobs, managing job queues, and maximizing utilization on tightly-coupled compute clusters. The question SchedMD posed was: can we combine Slurm’s HPC prowess with Kubernetes’ flexibility and user-friendly interfaces?

Slurm provides features to orchestrate batch jobs (e.g. precise CPU pinning, coordinated launch for MPI across nodes), but it lacks the cloud-native UX and scalability. SchedMD is developing Slinky as a toolbox of components to integrate Slurm with Kubernetes8. In essence, Slinky lets you deploy a Slurm control plane inside Kubernetes and coordinates scheduling between the two systems. Slinky bridges HPC and cloud native by running Slurm on Kubernetes and exposing Slurm’s capabilities through Kubernetes. Tim shared that by combining Slurm’s robust scheduling with Kubernetes, Slinky delivers “HPC-level performance and scheduling within an accessible cloud native platform”. AI researchers or batch users get the best of both worlds - they can submit jobs that benefit from Slurm’s gang scheduling9 and queue management, but still use Kubernetes constructs and elasticity.

Building on the same note, “Cloud Native AI: Harness the Power of Advanced Scheduling for High-Performance AI/ML Training” by William Wang (NVIDIA) and Xuzheng Chang (Huawei) highlighted the unique scheduling challenges posed by AI/ML training jobs: Multi-GPU, Multi-node training, gang scheduling, GPU topology and affinity, heterogeneous resources and co-scheduling.

The rationale is compelling. Many companies and AI/ML engineers prefer the Kubernetes ecosystem and its APIs, but Kubernetes alone lacks some advanced scheduling and tight job coordination features HPC and AI/ML users need10. Interestingly, KWOK11 was mentioned as a tool to simulate massive clusters of fake nodes and safely prototype scheduling strategies without needing a physical and large cluster for testing. This highlights how something like KWOK is enabling faster innovation in the scheduling domain itself. A great challange that the industry and academia has faced in the past 10 years could be solved by contributing KWOK and building a research community around it that could make it a standard for simulating and comparing scheduling algorithms.

Multi-Cluster Orchestration: Karmada and global scheduling#

Multi-cluster management has evolved into a mature Kubernetes capability, offering practical benefits such as enhanced availability, optimized resource usage, and simplified global deployments. With solutions like Hypershift, ACM, and OCM available, companies are now looking for workload efficiency, geographic optimization, or robust policy and compliance management - to select the most suitable solution for multi-cluster, multi-cloud and heterogeneous clusters orchestration solutions. One challenge acknowledged is multi-cluster complexity: just because we can do it doesn’t mean we always should. It introduces new failure modes and consistency issues. But overall, I was impressed: multi-cluster Kubernetes has moved from theory to practice. Seeing a major financial organization like Bloomberg adopt Karmada gives confidence these tools are ready for enterprise-grade deployments.

While advanced scheduling within a cluster was a hot topic, another dimension is orchestrating multi-cluster environments. As companies grow their infrastructure, they often end up with multiple Kubernetes clusters (for resiliency, geographic distribution, or multi-cloud strategy). That’s where projects like Karmada come in. I attended “Multi-Cluster Orchestration System: Karmada Updates and Use Cases”12 by Hongcai Ren (Huawei) and Joe Nathan Abellard (Bloomberg). This was essentially the “what’s new in Karmada” talk, and it underscored how far multi-cluster management has come.

For context, Karmada13 - short for “Kubernetes Armada” - is a CNCF incubating project that provides a unified control plane to manage multiple Kubernetes clusters. It presents Kubernetes-native APIs to deploy applications that span or failover across clusters. Karmada can propagate Kubernetes objects to member clusters based on policies, and it features a multi-cluster scheduler that decides which cluster should run a given workload. The tagline on their docs - “run your cloud-native apps across multiple clusters and clouds, with no changes to the apps” - nicely sums it up.

One challenge of multi-cluster deployments is ensuring that services (especially stateful ones) handle data and communication correctly across clusters. Karmada doesn’t magically replicate databases, but it provides hooks to coordinate with underlying storage or use global load balancers. Karmada can work with external DNS or cloud global load balancers to route traffic to the nearest healthy cluster.

Joe from Bloomberg shared some real-world use cases. Reliably propagate resources to a set of clusters, federate model cache resources on GPU nodes across a set of clusters to reduce AI models warm-up time, stateful applications failover and resiliency in multi-cluster, multi-cloud and heterogeneous Kubernetes clusters. In all cases, the benefit is centralized control and policy: operators define deployment policies once, and Karmada takes care of coordinating all the member clusters.

Karmada’s evolution signals that multi-cluster management is maturing. In comparison, Hypershift14, Red Hat Advanced Cluster Management15 (ACM) and Open Cluster Management16 (OCM) also offer multi-cluster orchestration capabilities. ACM provides centralized policy-based management tailored specifically toward enterprise governance, compliance, and lifecycle management. OCM is a community-driven project that employs a hub-spoke model. A central hub cluster manages multiple managed clusters (spokes) by deploying agents on each managed cluster. These agents facilitate communication and control between the hub and the managed clusters. While Karmada focuses more directly on workload placement, scheduling, and cross-cluster resource optimization, ACM, Hypershift and OCM excel at robust cluster boostrapping and provisioning, governance, compliance enforcement, and comprehensive policy management. Will these projects converge or remain distinct? Will their communities collaborate?

Event-Driven Autoscaling: Karpenter and KEDA unlock the next level autoscaling experience#

The strategic adoption of KEDA and Karpenter offers organizations a compelling path toward improved operational efficiency and substantial cost reduction. As Kubernetes deployments scale in complexity and cost, the ability to dynamically and intelligently manage resources through event-driven autoscaling will increasingly define competitive advantage.

Shifting focus from scheduling to autoscaling, another significant theme at KubeCon was optimizing Kubernetes resource utilization through intelligent, event-driven scaling. The session “KEDA: Unlocking Advanced Event-Driven Scaling for Kubernetes”17 by Zbynek Roubalik and Jorge Turrado emphasized KEDA’s latest features and its growing role in enterprise cost optimization.

Zbynek, one of KEDA’s maintainers, highlighted several new features positioning KEDA as more enterprise-friendly than ever before, including OpenTelemetry integration, admission webhooks for scaling, and an expanded library of scalers. Notably, KEDA now supports HTTP-based scaling, enabling precise, rapid autoscaling of APIs based on real-time request rates - a significant improvement over traditional Kubernetes HPA capabilities.

Crucially, KEDA18 was presented as a robust cost optimization solution. By scaling workloads to zero during idle periods and dynamically responding to real-time demand, KEDA has demonstrated substantial cost savings. An event-driven data pipeline case study showed a remarkable 70% reduction in cloud costs, highlighting KEDA’s practical financial impact. AWS echoed similar figures, presenting scenarios where integrating Karpenter (for node autoscaling) with KEDA (for pod autoscaling) yielded cost reductions of around 70%, capturing attention from both technical teams and finance stakeholders19.

Karpenter20, frequently discussed alongside KEDA, has quickly become a mainstream cluster autoscaler. Originally developed by AWS and now a CNCF project, Karpenter intelligently provisions nodes tailored to specific workload requirements, significantly improving over the classic Cluster Autoscaler in speed and flexibility. By proactively launching optimal node types (e.g., GPU or spot instances), Karpenter complements KEDA’s event-driven pod scaling, ensuring seamless, rapid scaling from zero to large clusters without manual intervention.

The integration of KEDA and Karpenter represents a powerful synergy: KEDA instantly adjusts pod counts based on event-driven signals, while Karpenter swiftly provides the exact infrastructure needed. This combined solution effectively automates resource management, enhancing operational efficiency and drastically reducing cloud expenditures.

Looking ahead, the KEDA community is exploring advanced features such as predictive scaling, which anticipates workload surges based on historical patterns, and deeper integrations with cloud-native event sources. Importantly, KEDA enhances rather than replaces Kubernetes’ built-in Horizontal Pod Autoscaler (HPA), providing a low-risk pathway for adoption in enterprise environments.

A New Kubernetes for New Workloads#

KubeCon + CloudNativeCon Europe 2025 painted a clear picture: Kubernetes is evolving rapidly to support the growing demands of AI, big data, batch processing, and global-scale deployments, all while continuing to harden its core and nurture a vibrant community. From a business perspective, technologies like KEDA and Karpenter offer real-world cost savings (up to 70%) by enabling more intelligent, event-driven scaling. Tools like Volcano, YuniKorn, and Kueue open the door to better batch processing and GPU utilization, turning idle time into productive work. Multi-cluster orchestration with Karmada and governance tools like OCM enhance our ability to meet strict SLAs with geo-redundancy and disaster recovery strategies.

The community has the opportunity to consolidate the orchestration of workloads (e.g., HPC and microservices) into unified environments, simplify operations, and let Kubernetes become the de-facto standard for multi-cloud, multi-cluster, workload-aware orchestration of parallel systems. This not only drives efficiency but also creates room for new offerings - think “AI training as a service,” globally distributed applications, or large-scale CI workloads that previously didn’t fit well in Kubernetes. Investing R&D cycles into evaluating and contributing these technologies and scenarios now can lead to high reward.

On a human note, attending KubeCon in person was an energizing reminder of why open source matters - not just for the code we write, but for the community we build. In a world where so many of us are spread across time zones, communicating daily through Slack threads, GMeet calls, and GitHub issues, face-to-face time is rare - and incredibly valuable. Being physically present meant more than just attending sessions; it meant hallway conversations that sparked new ideas, chance meetings that led to future collaborations, and the simple joy of sharing coffee (or a pint) with people I usually only see as profile pictures. Reconnecting with maintainers, contributors, and colleagues gave me renewed perspective on the shared challenges we face and the collective progress we’re making.

I’m especially grateful for the time spent with my teammates - Denis, Prashanth, Sherine, and Andy. Sharing beers and laughs together outside of a scheduled meeting reminded me that trust, creativity, and momentum are all built on real relationships. Those moments of connection will echo in the decisions and direction of our work in the months to come. And to have all this happen in the city I now call home for a while now made it all the more meaningful.

One of the most inspiring conversations I had was with elmiko. What started as a technical chat about Kubernetes, Karpenter, and autoscaling quickly unfolded into a wide-ranging discussion on the role of AI in our infrastructure - and in our society. We explored the philosophical dimensions of this new era: what it means to be human in a world increasingly shaped by intelligent systems. Drawing from Marx and Engels’ Manifesto, we reflected on the parallels between the industrial and AI revolutions, and how shifts in labor, automation, and knowledge production may transform the fabric of society.

Conversations like this remind me how important it is for engineers - not just policy makers or ethicists - to actively engage with the ethical and socio-political implications of our work. As builders of the systems that increasingly mediate our world, we carry a responsibility to ask hard questions and explore their broader impact. This becomes even more crucial in open source, where progress is driven by collective, decentralized effort. Our community thrives on shared values and collaboration across borders, and that gives us a unique opportunity - and obligation - to shape technology that serves humanity, not just industry. I hope to continue this kind of dialogue with elmiko and others, perhaps at DevConf.cz in Brno. These are the conversations that stay with you long after the conference ends.

Kubernetes’s second decade is underway - and with its growing ability to run specialized, distributed, and intelligent workloads, it’s poised to become the universal control plane for modern infrastructure. Let’s make sure we’re right there with it. See you at the next KubeCon. 🚀

Bonus talk: KubeCon + CloudNativeCon Europe 2025: Kubernetes and AI To Protect Our Forests: A Cloud Native Infrastructure for Wildfire Prevention - Andrea Giardini, Crossover Engineering BV

Also read: A perspective on the current and future state of Kubernetes scheduling

References#