GPUs Kubernetes BareMetal Platform Engineering kMetal Kamaji Hosted Control Plane

Introducing kMetal, Part 2: Under the Hood

Part 2 of our kMetal introduction. A deep dive into how kMetal works: hosted control planes, kernel-native KVM isolation, software-defined networking, fleet reconciliation, cluster autoscaling, and multi-site HA.

Tuesday, March 17, 2026 Adriano Pezzuto

TL;DR

Multi-tenant Kubernetes on bare metal has three costs: dedicated control plane machines per tenant, a full hypervisor stack installed solely for isolation, and fragmented fleet operations that don't scale. Our new solution kMetal eliminates all three in one platform: 1. Hosted Control Planes from Kamaji cuts control plane hardware by 60%. 2. Native KVM and OVN software-defined networking replace the separate hypervisor and network products. 3. Profile-based fleet management with continuous reconciliation replaces cluster-by-cluster operations. A new tenant cluster provisions in 20 seconds. No VMware, no Nutanix, no OpenStack or Proxmox, just kubectl.

In Part 1, we described the reasons why developed kMetal. Several readers asked for more technical details. This post is for them. I'm going to walk through the architecture: what we built, how it works, and why we made the trade-offs we did. I'll try to keep it honest about what kMetal does well and where we deliberately drew the line.

Where we started

Our open core project Kamaji was already running in production at Public Cloud Providers and Enterprises. The hosted control plane problem was solved. But every large deployment taught us the same lesson: eliminating the control plane tax is necessary and not sufficient. Operators still busy on a full hypervisor stack for tenant isolation, a separate network product for traffic segregation, and a patchwork of tools for fleet lifecycle management.

Five constraints shaped most of the design:

Hard multi-tenancy on shared hardware. Tenants share physical servers but cannot share control plane, kernels or network domains.
No separate infrastructure stack. No product with its own console, lifecycle, or skill set. Everything through the Kubernetes API.
Fleet-scale lifecycle management. Hundreds to thousands of clusters managed declaratively from a single control plane.
Single API surface. One API for compute, networking, isolation, and cluster lifecycle.
Bare metal as the starting point. No assumption of an existing hypervisor or cloud underneath.

Hosted control planes

The Hosted Control Plane (HCP) architecture makes everything else economically viable. Without it, "real cluster per tenant" is too expensive at scale. Kamaji moves tenant control plane components into pods on a small set of shared management nodes. If a management node fails, tenant control plane pods reschedule to healthy nodes automatically.

The numbers: Cluster provisioning takes about 20 seconds. Control plane upgrades complete in about 16 seconds via blue/green pod replacement with zero downtime. A 30-tenant deployment that traditionally needs 90 dedicated control plane machines runs on 3 shared nodes, a 97% reduction.

Learn more about HCP at Kamaji website.

Real Clusters, not Virtual Clusters

This is worth addressing directly because the industry has started conflating the two.

There's a class of solutions that creates lightweight, namespace-scoped "virtual clusters" inside a host cluster. The tenant gets what looks like a Kubernetes API, but under the hood it's a translation layer on top of shared infrastructure. The tenant's "virtual cluster" is a projection, just a filtered view of the host cluster's resources. Some of these solutions have recently started targeting bare metal too.

This approach is fine for dev/test environments where isolation requirements are low and speed matters more than boundaries. But for production multi-tenancy on bare metal, cloud providers, sovereign deployments, regulated industries, AI Factories, it breaks down. And noisy-neighbor problems are structural: tenants compete for the same host resources without hardware-level boundaries.

kMetal tenants get real, upstream, CNCF-compliant Kubernetes clusters. Dedicated control plane. Dedicated datastore. Dedicated kernel. Dedicated network domain. Full cluster-admin access, the real thing, not a "translation". Each tenant can run a different Kubernetes version, different add-ons, different policies. From the tenant's perspective, it's indistinguishable from a cluster they built themselves. The difference is that it provisions in 20 seconds and costs a fraction of what dedicated hardware would.

Compute isolation: KubeVirt

Multi-tenant Kubernetes on shared hardware needs compute isolation. Namespace boundaries and network policies aren't enough for hard multi-tenancy: a container escape compromises every tenant on the host. Container sandboxing like gVisor, Kata, Edera, isolates at the pod level, which is more granular than we need and adds per-pod overhead that compounds when tenants run many pods.

We isolate at the tenant level using KubeVirt, the CNCF project that brings KVM virtualization into Kubernetes as a native resource. Each tenant's worker nodes run in their own VMs with dedicated kernels. The isolation boundary matches the trust boundary between tenants, not between individual pods.

Why KubeVirt? KVM is a kernel module, not a product. No separate hypervisor to install, license, or manage — this is what lets us embed it in the platform. KubeVirt makes KVM a Kubernetes-native primitive: VMs are created, scaled, and deleted through the Kubernetes API, just like pods. And the performance overhead is minimal and predictable. From the tenant's perspective, worker nodes are just Kubernetes nodes in `kubectl get nodes`. From the operator's perspective, worker VMs are Kubernetes resources managed through Cluster API. No vCenter, no Prism, no Proxmox, no Horizon.

Network isolation: Kube-OVN

Separate kernels aren't enough. Without network isolation, tenants on the same physical network can observe each other's traffic. And Kubernetes network policies won't cut it here, kMetal tenants have cluster-admin access and can modify or delete their own policies. Policies also operate within a shared network stack, so all tenants share the same overlay, DNS namespace, and service discovery.

kMetal by default uses Kube-OVN, built on OVN (Open Virtual Network), to give each tenant a fully isolated network domain at the platform level. Each tenant gets its own Virtual Private Cloud (VPC), its own IP address space (overlapping CIDRs are fine!), and its own DNS namespace. Traffic is segregated, not filtered: separate network planes entirely.

We evaluated every major Kubernetes CNI for this. But most CNIs used out there are designed around a flat, shared network model. They can filter traffic between tenants using network policies, but they can't truly segregate it. The network is still shared. Tenant pods still live in the same overlay, the same IP space, the same DNS namespace. For hard multi-tenancy, filtering is not isolation. OVN is the only network solution that provides real network-level multi-tenancy. This is why telcos and cloud providers have used OVN for years in their infrastructure — and it's why we chose Kube-OVN for kMetal.

The SDN is managed through the Kubernetes API as part of cluster profiles. No separate SDN controller console.

Fleet Management

Fleet management is what makes kMetal operational at scale. These are the foundation provided by Kamaji and Sveltos:

Cluster Profiles. A profile defines a cluster's target state: version, storage, networking, security policies, mandatory add-ons, resource quotas. Assign a profile to a group of tenant clusters. Change the profile, all assigned clusters update.
Reconciliation Loop. A continuous loop compares each cluster's actual state to its profile and remediates drift, version mismatches, missing add-ons, drifted policies, expiring certificates. If a tenant user removes a mandatory component, the loop puts it back.
Lifecycle Automation. Provisioning (~20s from profile), blue/green control plane upgrades (~16s, zero downtime), rolling worker upgrades, fleet-wide patching, automated certificate rotation, per-tenant etcd backup. All API-driven, all available through GitOps.

Cluster Autoscaler

Traditional bare metal Kubernetes means fixed capacity per tenant, often meaning over-provision for peak and waste hardware the rest of the time.

kMetal's embedded KVM layer changes this. Because worker nodes are lightweight, the platform can create and destroy them on demand without hardware procurement cycles. The autoscaler watches pending pods and node utilization per tenant. When a tenant needs more compute, new workers are created on bare metal and joined to the cluster. When demand drops, nodes are drained and destroyed, returning capacity to the shared pool.

Operators define min/max node counts per tenant. During off-hours, clusters scale to minimum. Combined with Control Plane sleep mode, an idle tenant consumes near-zero resources. Instead of provisioning each tenant for peak load, you provision for aggregate demand.

High Availability

kMetal supports multi-site HA by stretching both the infrastructure under cluster and tenant cluster across physical sites, so the loss of one site doesn't take down the control plane, etcd data is replicated synchronously across sites using Raft consensus where a write is acknowledged only after it's committed on a majority of nodes. For two-site deployments, a lightweight cloud-based quorum witness handles split-brain prevention without requiring a full third site. Failover is automatic: all the tenant control plane pods reschedule to the surviving site through standard Kubernetes scheduling.

What we left out

Storage. kMetal is storage-agnostic. Bring any CSI provider, NetApp, PureStorage, Dell, Ceph, whatever you already run. Storage for Kubernetes is almost a solved problem; bundling one would limit integration with existing investments.

Legacy VMs. The KVM layer in kMetal exists for Kubernetes tenant isolation, not for running arbitrary Windows or Linux servers. A general-purpose hypervisor has different requirements (live migration, guest OS compatibility matrices, VM template management) that would dilute the Kubernetes-native operational model.

AI Platforms. kMetal supports GPU passthrough. It doesn't include ML/AI framework management or training job orchestration. It provides the isolated, GPU-capable cluster. What runs on top, e.g. Kubeflow, Ray, RunAI, is the tenant's choice.

The Kamaji Foundation

Kamaji is open source under Apache 2.0 and stays that way. It handles the core cluster lifecycle: tenant control planes as pods, dedicated datastores, network tunneling, Cluster API integration.

kMetal adds what bare metal fleet operations require: embedded compute and network isolation, fleet management with profiles and reconciliation, cluster autoscaling, multi-site HA, self-service, and enterprise support.

Engineers can deploy Kamaji independently to validate the HCP model before evaluating kMetal. Kamaji is the technical proof that the architecture is sound. kMetal is the operational proof that it works at enterprise scale.
If you want to see how this looks in practice, we'd love to walk you through it. Thank you.

---

Adriano Pezzuto is the founder of Clastix. He has spent more than 10 years working on Kubernetes and more than 25 years in the infrastructure space.