How hard is it to deploy kubernetes on bare metal nowadays ?
Introduction
Before you start, this post is the second one is my series about choosing bare metal kubernetes and self managing it within a small startup so you might be interested in the first one: Why did we choose to self manage kubernetes in 2022 for a small startup ?.
So we left off when we decided to choose to host kubernetes on bare metal and self managing it, now is the time to explain how we actually got it running. I’m gonna split this into the following sections:
- Networking (aka the thing that will generate problems): how containers will communicate betweem themselves and the outside world ?
- Control plane (aka the stuff that decide what happens in your cluster): deploying apiserver,scheduler,controller-manager and etcd
- Data plane (aka the thing that will run your workloads on your nodes): configuring containerd & kubelet (without kube-proxy)
- Ingress (aka load balancers): picking something to actually handle traffic from your users
Before continuing i’m gonna repeat what i said in the first part: all of this is tailored to our needs at Reelevant, we tried to keep it as simple and dumb as possible (note that i say “tried”, some might think this is already complicated but that’s fine for us).
Even though i’m writing this in September 2022, we worked on this project between November 2021 and March 2022 so some pieces are already out of date with the current possibilities and will be even more in the future :)
For reference we made the first line of code in early november and finished a first working version of the cluster early january and we shifted 100% production traffic on it at the end of march. I was only part time on it (dedicating ~10 hour in average per week) and our intern at the time (@Vardiak, who is looking for an internship for Feb 2023 !) was pretty much full time from november until it’s internship end in january.
Last note, i will not share exact configuration or code we wrote for this because the goal of the posts is to share our journey and not being an tutorial (if folks want to discuss more, reach out directly) but here’s few important informations:
- We use Ansible for automating the configuration of our servers because you certainly don’t want to manually configure ~10 software in a emergency.
- We provision them manually (choosing OS, user & inserting ssh keys) on OVH’s UI, after that ansible do everything.
- We wrote few roles (for non-ansible users, role are a group of “scripts” that install/update a specific software), forked others and re-used opensource ones too: mix is respectively 20%/60%/20%.
- Some stuff we did initially weren’t usable for us without further tuning, checkout the next article (when it comes out :)) that will explain every changes we had to make.
Just before diving in, i want to thank Robert Wimmer aka githubixx; his blog is a gold mine for anyone wanting to deploy/understand kuberentes, he explain how he did deploy his bare metal clusters, which technology did he choose and why, how to upgrade component etc. Plus the ansible roles that he wrote that helped us a lot. Even though we had to divert from his configuration for our need, it’s was a huge headstart in this project.
Networking
This is the most important choice you need to make when deploying a bare metal cluster: how do containers will communicate ? In Kubernetes this is handled by the Container Networking Interface (refered as CNI for now on). There a lot of CNIs that you can pick, the well-known ones being:
- Calico
- Cilium
- Flannel
At the time (and i still think it is), the coolest kid on the block was Cilium because its underlying use of eBPF which i will try not go into what is it but the most important thing to remember is that its relatively new (eBPF was declared stable in 2016) and gamechanging technology for many use cases involving networking. Note that other CNI (including Calico, …) now can be configured to use eBPF too.
I encourage you to read their documentation which is exceptionally well written.
Anyway Cilium was new, heavily adopted (GKE annonced their intent to use it few weeks prior, more joined since), offered a lot of interesting features (ordered by importance for us):
- Simplification: possibility of removing kube-proxy which is a piece of vanilla kubernetes deployment that allows to route traffic between pods. This is important because we needed a CNI anyway but if it could remove one component to configure/manage this was a nice gain.
- Performance: while the vanilla kubeproxy use iptables, Cilium’s use of eBPF can be at a lower level (so avoid unnecesary compute) and more optimized (read out key differentes on their doc
- Observability: as some may point out i didn’t included Observability in this post because i will write a dedicated one in a later post, anyway Cilium allows to monitor HTTP traffic and extract metrics from it which is hugely important (and it doesn’t require classic service mesh) for understand how the system will behave.
- Future possibilities:
- Cross cluster load balancing: As we are currently only serving traffic from Europe latency is fine however in the future we might want to deploy other clusters to better serve another region like Asia or North-America so having the possibility was a plus.
- Support multiple strategies for IP address management: in the cloud most IP addressing is automatic but in bare metal it can be a bit more complex so multiple strategies can be helpful.
- Support network policy rules to limit traffic flow between services that shouldn’t talk to each other or to the internet generally.
Okay so now our containers should be able to reach kubernetes services/pods via Cilium, however the elephant in the room is the underlying network that will route this traffic. For this we (initially, wait for the 3rd post for more) decided to rely on “public” network to communicate between nodes, “public” network here means the internet infrastructure of OVH (between their servers and datacenters).
While it’s an easy way to think to consider it done, we still need to be sure that our traffic is encrypted so no one can see our data or attack it. We have multipe ways of addresing it:
- Using HTTPs, however it has drawbacks at this scale: distributing and managing certificates to every pods which is not easy at all.
- Using a service mesh (like recent Cilium version support or other like Linkerd) could work but by experience service mesh is still complex to operate.
- Using a kernel level software, generally choosing between IPSec, OpenVPN or (and what we choosed here) Wireguard.
Wireguard is at it’s root a simple software: take this configuration with public keys of servers that you want to communicate with, the private key of the current server, bring it as a dedicated interface and boom you have a interface that can encrypt/decrypt packets on the fly. This is the technology behind a lot of VPN providers noawadays (like Tailscale).
Now, you can configure it with Ansible and you can have a encrypted virtual network on top of any internet infrastructure that you can safely use. Note that even though Wireguard or IPSec can be configured through Cilium, some servers (ie: databases and control pane) will not be nodes of the kubernetes cluster so we must configure it manually on all of our servers.
To resume:
- We use Wireguard to create a virtual network on top of the internet to securely communicate between servers
- We use Cilium to allow pods to reach each others and kubernete services.
Now that we have a working network in theory, we can move forward and start configuring our control pane that will orchestrate our kubernetes cluster.
Kubernetes control plane
The control plane (as opposed to data plane) is composed of different components that each do one important thing to make any cluster works:
- etcd is responsible of storing data (you can see it as the kubernetes database) like pods, deployments etc.
- kube-apiserver is just the API that everyone interact with to create, update or delete resources that are stored in etcd.
- kube-scheduler is responsible for finding a available node to run pending pods while taking into account resource limits, affinity etc.
- kube-controller-manager is responsible of watching the state of different resouces and respond with the correct action: is the CPU usage of this app above 80% ? okay well i will need to create one more pod then.
The most important component here is etcd because if it goes down nothing can changes anymore, pods that exist will stay there but nothing new will be created, even if a node disappear your load balancers will not be aware of it so they will continue sending request to them !
I will not really go into the details on how to deploy etcd as you can find dedicated resources online however i will explain our setup:
- We deploy a fleet of virtual servers with all of those components which means each one have one etcd server, one apiserver, one scheduler and one controller-manager. The goal here is to be highly available, if one server goes down there are still a lot more to continue the service.
- Since etcd need a quorum to operate, we have 5 servers currently active. You always need
N*2+1
servers to maintain quorum so if i want to be fine if a server goes down, i need1*2+1
=3
servers. For us we want to fine if 2 servers goes down so5
servers.
And that’s pretty much it, these are quite simple to configure. You just need to be sure they are indeed highly available after deploying (try restart a server for example).
Quite note about security, you should only allow your data plane servers to reach out to those components. In our case we’ll use Wireguard to avoid exposing those to the Internet.
Kubernetes data plane
The data plane is even simpler than the control plane, each component will be installed on hosts where we want to run our workload:
- kubelet is responsible for retrieving pods that need to be run on its node and interacting with the container runtime to run them.
- kube-proxy SHOULD be responsible for configuring the server to redirect IPs of pods on other nodes to their correct destination, however cilium handles that for us with eBPF so we don’t need it :)
- containerd (our container runtime) is responsible for actually pulling container images and starting them on the host. Note that there are other runtime (ex: podman or cri-o) but i went for the most used (containerd is using behind the scene by Docker).
At this point you should be able to create deployments on your cluster and it should run !
But wait a minute, if you create multiple kubernetes services you’ll see that you don’t have a way to reach them with DNS internally because we don’t have configured a DNS server.
For this mission, we used CoreDNS which is the de-facto standard for DNS in kubernetes, quite simple to deploy it with Helm.
Now your services can talk to each other, nice ! Buuuut they can’t auto scale because resource usage is not showing up and for this you’ll need metrics-server which we’ll get metrics and make it available for other components in the cluster.
Okay ! DNS, check. HPA, check. Now we need a way to reach our workload from internet so that our frontend or whomever need our APIs can use it.
Ingress
In the cloud we just ask for a load balancer and after few minutes you have an ip where you can send requests to, obviously thats not as simple for us.
First, we need to decide which host we’ll send our requests too. We decided to use the same nodes that run our workloads for few reasons:
- As detailed in the first post in the series, we are time sensitive so if we can reduce latency by hosting as much as possible on the same node, we take that !
- We can more easily configure topology aware hints so that requests hiting a load balancer in a region stays in this region to get better performance.
- More availability: we don’t want to maintain a seperate pool of nodes for load balancers and ensure there are enough of them.
Also we weren’t a fan of the one ingress per API design that was the standard at the time for K8S, we were much more interested in a gateway that would be able to route to where we want add perform other things in between (later on, the Gateway API picked up much interest from the community and is now in the latest kubernetes version).
There were a few projects that could satisfy us, here a non-exhaustive list:
We settled on Gloo for the following reasons:
- It uses Envoy to route traffic which is battle tested (specially useful for monitoring later on)
- It support webassembly plugins if at point we want to make heavy compute for routing
- Allow traffic shadowing (put simply, sending the request to its original service and another, the shadow service to test it)
- A custom auth server which was needed to route traffic to other different microservices. The goal was to transparently hit our auth server to validate the request and then forward it to the correct service.
- The fact that the team behind feels much more committed to opensource (even though they have an enterprise product) than Kong or Ambadassor.
- We liked the “simplicity” of the architecture.
Not directly related to the ingress technology but generally linked, we obviously need TLS certificate to serve HTTPS which we generate using the standard tool in the community: certmanager.
Deploying & configuring it is quite easy with Helm so nothing to add here !
Conclusion
At this point we have a cluster that can receive traffic from users, reach internal services and obviously run the workload to answer them.
Althrough this is in theory, we made all the research for each solution independently and then started the work to deploy them. You’ll see in the 3rd part of the series what part didnt work up to the expectation (spoiler: everything worked out of the box, i’m mostly referring to performance improvement needed for our use case).
You might think that i’m a fool for not speaking about security and monitoring which both are huge topics in themselves.
- Don’t worry we have some security measures in place, however these aren’t really interesting to talk about for now. That’s a on-going project for us so i’ll make a dedicated post when we have something to talk about.
- Monitoring is obviously important to me because of my involvement in OpenTelemetry, that’s why i plan a separate post to explain our setup.
That conclude the second part of the post series, please feel free to reach out if you have any question or suggestion.
You can subscribe on substack if you want to learn where the next post of the series (or generally anything comes out !).