An update on hosting Kubernetes on bare metal

Introduction

It has been almost 3 years since I’ve started my journey using Kubernetes on bare metal servers. Work has been busy and I think it’s time to reflect on this decision (as these decisions take a few years to compound).

So, where are we now? Our cluster has grown:

  • from 96 to ~250 CPU cores (Intel to AMD)
  • from 5 to 6 nodes (preferred to scale vertically instead of horizontally)
  • from 320 to ~700GB of RAM
  • from ~5 Gbps to ~110 Gbps of bandwidth

The scale is obviously linked to customer demand:

  • 2x growth in number of inbound requests per year
  • p99 went from ~1500ms to ~950ms
  • p50 went from ~800ms to ~350ms
  • peak inbound traffic went from ~0.6k req/s to ~3k/s

Obviously not everything is related to infrastructure, we spent a lot of time optimizing backend services (moving rendering to WASM and Rust, for example).

I will directly state the conclusion: I don’t regret doing it. It has been a great choice for our company and allowed us to survive in difficult times.

Networking

This has been by far the subject of most issues. As explained before, we do a lot of I/O operations to perform the service, even more now than when we started in 2022.

When we started, we opted to route internal traffic with Cilium (with VxLAN) over our encrypted WireGuard link. However, this was a bad idea. The double encapsulation of the traffic (first over UDP VxLAN then over UDP WireGuard) meant a lot of overhead for a TCP connection: WireGuard UDP packets are 1500 bytes, remove 80 for its headers, we are left with 1420. Adding VxLAN headers overhead and we are down to 13XX bytes per TCP segment.

At this point, we clearly needed to remove one layer. My choice went over VxLAN since WireGuard is used across our fleet and not only in the Kubernetes cluster.

The trick was to route a /24 prefix to each Kubernetes node with WireGuard, which allows for 256 IPs per host (enough for our setup, we don’t have more than 256 pods on a given machine).
Cilium allocates IPs from a preconfigured pool inside this /24, and we add a route so that each host routes all of this /24 into the Cilium host network (from which it will be able to redirect to the correct pod).

Simply put, here’s the life of a packet from server A to B:

  • Pod 10.1.1.1 (pod 1 on server A) pushes packet to 10.1.2.1 (pod 1 on server B)
  • Server A sees that 10.1.2.1 is in the /24 of the node 10.1.2.0, so routes it there
  • Server B receives the packet to 10.1.2.1, since 10.1.2.1 is in the /24 of the current node, it routes that packet to the Cilium virtual interface
  • Cilium knows at this point to route that packet to the correct pod virtual interface.

That went well for some time, however at some point we found that the public bandwidth of those servers was maxed out, which obviously created issues.

Obviously bandwidth capacity can be bought, so we did that and bought a private network interface maxing at 25 Gbps per server (was 2Gbps previously). However, it wasn’t needed to have this private network across all the fleet, only on servers which required more bandwidth.

Another cool thing about WireGuard is that since it’s peer-to-peer, each host reaches another host with one configurable IP. This means we can change each server’s private IP that is in this private network to their private IP and keep the public one for servers that are outside of it.

Again, simply put: WireGuard routes traffic on the private network for servers in that private network and over public internet for others, which simplifies the topology for Cilium and generally Kubernetes.

While cloud providers would have just had more bandwidth capacity in the first place so we wouldn’t have had the problem in the first place, they would have kept a huge portion of revenue as bandwidth fees :)

In conclusion, networking is by far the hardest problem to solve, especially when you need performance out of it (which requires a lot of low-level knowledge) when going bare metal.

Scaling

One of the issues of going bare metal is that provisioning servers is hardly something that you can automate away in a few clicks. You need to order a new server, install everything on it, but before you actually need the power.

However, one of the advantages of bare metal is that capacity is just so cheap that you can just provision way more than needed and just wait until the usage is near what’s considered near max (for me it would be 80%) and just pre-emptively buy the servers.

When servers are installed and ready to configure, and since we had used Ansible from the beginning, this is actually quite fast to make it ready to receive production traffic. So in the end, I don’t think it’s worth making changes to provisioning even as we scale.

DNS

You’ve got a problem that seems to be network connectivity but the network is fine? Pretty sure it’s DNS, it’s always DNS. This sentence is quite known when working with distributed systems, and we didn’t escape this.

We had deployed CoreDNS initially because we wanted a configuration similar to what we had in GKE beforehand. However, scaling CoreDNS is not easy. Fortunately, when we did have issues with it, I just remembered that pod-to-service requests are actually resolved by Cilium, which means CoreDNS is just doing domain-to-service IP resolution, which we can replace by just configuring the IP instead of the service DNS domain.

I’m now contemplating removing CoreDNS as I’m fine relying on public DNS infrastructure for outgoing requests and using Cilium for internal ones.

Ingress

Gloo has been a good choice, not a great one though, because there has been a lot of moves with the native Kubernetes Gateway API released around v1.26.

Gloo is getting deprecated in itself and rebranded as KGateway, which wants to pretty much make an open implementation of the Gateway API spec from what I understood. This means we will need to re-work to change solutions. Fortunately, Cilium also made this move, so it’s possible we move that job to Cilium directly too.

One of the features that initially made us choose Gloo was the possibility to have a custom auth server for every request. Sadly, this need isn’t in the Gateway API right now, so that means we’ll need to find an alternative solution.

Currently, this solution looks like migrating every authentication and authorization concern to WorkOS since, from a product perspective, we want to delegate much of it to focus on our core products, so everything is coming together on this topic.

Monitoring

Something that is often taken for granted when using cloud providers because they mostly offer monitoring for free when paying for compute (they themselves need to have it, so it makes sense to offer it at ~no-cost). We decided at the beginning to use Grafana Cloud to store everything in a hosted fashion.

Soon followed the first invoice where we didn’t have all the metrics we wanted and already had huge bills. Clearly, all monitoring providers cost a fortune for metrics. The solution was simple: we already have a lot in bare metal, why not also store the metrics there? An 8-core, 64GB RAM and 1TB disk is around 100€/month. Our hosted Grafana instance can connect into it, so we still have our hosted instance used to visualize everything.

The drawback being configuring correctly the metrics store. Initially, it was Prometheus with 15 days of history (which is enough for us) and later upgraded to VictoriaMetrics. No issues since.

Security

Here comes the elephant in the room: security. I’m saying this because we are going through our SOC 2 compliance process and logically, you need a lot more work than using cloud providers.

Handling access, encryption, secrets, and documentation can be almost automated away when using cloud. We didn’t have this chance :)

I’m not saying it’s not feasible, of course. We did it and for now it’s fine, but that’s clearly a cost for teams that need to be compliant with those frameworks (whether it’s SOC, ISO, GDPR or your preferred one).

When discussing this with a fellow CTO a few days back, we agreed that if we had known earlier about compliance, we would have built from the ground up for them, and it would have saved a lot of time down the road.

The conclusion would be that if you think you’ll need to be compliant with specific frameworks at some point in the future: check all the requirements when building your infrastructure, at least to not be surprised, at best to directly implement things at the start.

Conclusion

As I mentioned at the beginning, this was clearly a good choice for us. It allowed us to reduce our costs when we had the time (whereas a lot of companies try to reduce costs at the last moment with only a few months of runway left).

I would say that this was not, and it is not, a “never use the cloud” post series. I think there are companies out there where this doesn’t make sense. It did for us because of how our product works and the people that worked there, so again, try to find the correct solution for your case and don’t copy others.

Now into the future, I’m planning to hire a dedicated SRE to maintain this infrastructure and continue making it scale. My job has evolved a lot since 2022, so while I don’t think I will be the one making future changes, I will still be there to understand them and report back on the learning along the way!