Why did we choose to self manage kubernetes in 2022 for a small startup ?

Introduction

I mean the question is quite normal to ask yourself when you hear that a company with <15 people and only 4 engineers decided to go down the “nightmare” of managing kubernetes themselves.
Just look at just the last month worth of post mentionning kubernetes on Hacker News:

Even though each author tries to be balanced about the fact of using Kubernetes (i.e. not saying you should never use it), i believe its still create a global sentiment on the internet that kubernetes is just too complex and that you should think twice before running your software with it.

Something that i think we all agree with is that you should always think twice before committing to use a specific tool or technology for the business you are responsible of since it has a huge compounding impact on it.

My goal with this this post isn’t to explain why you should or shouldn’t use Kubernetes but instead try to give one more datapoint on this decision: Why did we choose to self manage kubernetes in 2022 for a small startup.

Context

Obviously this decision is highly linked to problems we encountered before, resources we had and generally how much we wanted to commit to this project. My goal here is for you to have a high level view of them so you can hopefully see the motivation behind this decision.
Sadly for you, it involves few information about my previous role and what we do at Reelevant (where i work currently), i don’t want this to be an ad but i think it’s important to have a complete picture.

I’ll start by saying that before joining this startup, i had a previous experience at Keymetrics (TLDR: a monitoring product for nodejs) where i maintained both the backend and the infrastructure which was hosted on bare metal (it was as simple as ssh’ing into each server, running git pull, restarting apps and it was done). However we did migrate later on to a container based system using Nomad, Consul and Vault (generally called the HashiStack), which we ran for few months without issues (before leaving for unrelated reasons).

That gave me at least some confidence that i have the basic skills to manage a bare metal servers, which isn’t Kubernetes by itself, but at least that meant i knew some stuff about server provisioning/management, networking and different kind of failures you can encounter. Going back, i’m pretty sure that if i haven’t worked there you never would have read this post !

Now back to Reelevant that i joined in 2019 as a Devops Engineer, at the time they were already using the managed Kubernetes product of GCP (which was refered as the most advanced by most) with some success.
However the cost was quite important for our size (though nothing dangerous) so obviously they wanted to decrease them if possible. At this point some might say that we could have tuned our GKE config (node type/count) but i’m gonna explain a bit what we do so you can understand our problems.

Reelevant allows our customers to generate a content (generally inside emails which we’ll be an image) for a given client (customer of our customers) depending on the data they have about him, few examples:

  • Decathlon uses us to display product recommandation from their own AI team inside their emails. They have algorithms to decide which product to show but didn’t have the technology to generate a different visual for each one in their emails.
  • Clarins have their fidelity program that they want to showcase in their email but depending on the client, they offers different coupons or reductions which they can’t do without designing and sending different emails. We allow them to send the same email and change the visual depending on the client.

I will not continue with the whole list (as i don’t want this to be an ad), we have a marketing website that showcase more if you are interested. Anyway there are two things to note here:

  • We can generate a visual content PER client (whenever you have 4 or 20M) althrough sometimes one content are shown to different clients.
  • It’s generally integrated on marketing campaigns, which are sent at a precise hour but not predictible for us.

Generating an text API response is quite lighter (less than 5 kb) than the images that we generate (between 30 and 400kb) and I’m not talking about video or gif (between 500kb and 15Mb): network cost are huge, they were roughly 30-35% of our total cost.

Secondly, we needed to auto scale fast: when a customer sends an email to >5M clients you might have at least 200k that will open it as soon as it reaches their device (even more since Apple’s privacy update with iOS 15) which can make our traffic jumps from ~50 to 1000req/s (even though we have cache, we still mostly make one or two unique query per user).

So the conclude, what are our most important problems:

  • Networking cost (which we scale linearly with our business)
  • Performance in burst (which depend on our biggest customers)

A little detail that is important to continue here, we already had two systems hosted on self managed bare metal servers on OVH: Pinot and Pulsar. This was running (mostly) smoothly for the past 8 months at this point.

The Quest

Naturally the first thing we did wasn’t to ditch GKE and go all in self managing on bare metal servers, we did what everyone would do: make adjustments and then refactor some of our backend services hoping the problem will go away.
Increasing cache TTL when possible, profiling code and reducing payload sizes etc, but as you probably guess it wasn’t enough for us (I’m not gonna go into details into each of them to avoid making this post even longer).

What we were looking for:

  1. Find a provider where we don’t pay network bandwidth
  2. Have hardware that can survive cpu burst for few minutes
    1. By this, i mean we need to avoid virtualized instances since the hypervizor will limit how close we can get to 100% usage of a CPU.
  3. Still use our container images since our whole CD is built around it
  4. Good quality for the price
  5. Redundancy across multiple regions (yeah, it was even more on the list after the fire at an OVH datacenter)
  6. Hardware is located in Europe, preferably in France (we are based in Paris and i’m pretty sure you all know about GDPR)

Okay so we pratically eliminated most of the big players (GCP, AWS, Azure etc) with our first requirement, let’s see what we have left:

  • Hetzner
  • Scaleway
  • OVH
  • note that we didn’t went further with researching smaller provider, not because it was a requirement but because i found what i was looking for before.

All of the above provider offers a managed Kubernetes offering but with varying degrees of support for the common feature you would expect. However most use VMs to power those managed services (which make sense for them to isolate each workload) which would certenaly means that we were gonna be limited for bursting (see second requirement), so we abandonned the idea.

We looked into bare metal servers that we can potentially connect to a managed control plane so simplify our operations and weirdly (because i initially didn’t though a lot of people had this problem) both OVH and Scaleway had beta product for this, but we didn’t want to risk using beta quality software (not that it was scary but again we are only 4 engineers of which i was the only one that will be responsible for the infrastructure) that are too early so we don’t find ourselves stuck with problem nobody have.

So we ended up on the fact that we must manage our own bare metal servers for both the control & data plane, now on how to run our workload was still open.
Without surprise we want for using Kubernetes for the following reasons:

  • We already deployed on kubernetes so there will be little change in our CD pipeline (we were using ArgoCD)
  • Even though Nomad is a great product (as i said before, i already used it in the past), the ecosystem is much smaller. At this point we also need to factor in that finding engineers with nomad experience might be more difficult than kubernetes (which is already complicated) so it was preferable from a HR perspective (which obvisouly people can argue against).

Okay so here’s how our quest end:

  • We will go with OVH servers, most specially with their Advance-3 hardware which features an AMD Ryzen 9 5900X with 24 CPU and 64GB of RAM which will give us the burst needed compared to normal Xeon (they have much better multi core performance).
    • Note that to satisfy requirement 5), we’ll spit between two OVH regions: Roubaix and Gravelines (~100km apart) which should provide redundancy with sacrificing too much latency.
  • We’ll deploy our control plane ourselves, we settled on using VPS (which are virtualized and not bare metal but we didn’t see the need for it) with 4 CPU and 16GB of RAM which should be well enough for kubernetes’s control plane and etcd.

Conclusion

As i hope you can conclude yourself, we wouldn’t have gone this route if i didn’t have the previous experience with running bare metal with the problem it come with (specially networking). Plus we had really specific requirement because of the business we have which don’t apply to everyone so again, please take into account your need and resources when picking a technology !

Obviously this is not even half the story, only picking a technology is rather the simplest part, implementing is where you get all the fun :)

To keep article light (and since i don’t have much experience in writing), I’ve decided to split this into 3 different articles:

  1. Why did we choose to self manage kubernetes in 2022 for a small startup ? (This one.)
  2. How hard is it to deploy kubernetes on bare metal nowadays ?
  3. Is managing kubernetes really a nightmare ?

You can subscribe on substack if you want to learn where the next post of the series (or generally anything comes out !).
Obviously some folks might have done something differently (after all we are on the internet), so please reach out if you have feedback (about the writing or specially the choice we’ve made) !