K3sup – bootstrap K3s over SSH in < 60s(github.com) |
K3sup – bootstrap K3s over SSH in < 60s(github.com) |
I have one. And it's managed. I don't think there's significant cost savings to going unmanaged, but maybe. Even so, why would I need a ton of them?
Can’t use cloud stuff on-prem and also if your clients have a server room of their own. Same for homelab.
Also it’s nice not to shift the pets attitude from servers to clusters and instead treat everything as cattle - provided you have backups of persistent data and the config versioned in a Git repo and there’s maybe some Ansible in the mix, being able to recreate an environment in the case of a fuckup is nice and also helps against bit rot.
Disclaimer: I actually prefer Docker Swarm/Compose over K8s due to simplicity (which matches my deployments and scale), but in the cases where I had to use a variety of K8s, going for K3s was pretty okay.
The cluster isn't that hard to recreate if things go south. Everything is in YAML configs already. And since I'm managed, it's just a few clicks in DigitalOcean to create a new cluster. And actually, I think I can create clusters through their CLI too, so if I did want to automate it, it's already ready to go. So I'd say I'm cattle-ready, but too cheap to pay for more cattle.
I nearly went Docker Compose/Swarm by accident when I was just getting started. I knew I wanted to dockerize my app but then couldn't figure out how to get it into prod. Then I found out people don't actually use Docker Compose for prod it seems and eventually stumbled into Kubernetes. It took a few weeks to wrap my head around, but I'm happy with it now.
Once you have a nice set up, I'd say it's pretty simple to maintain. DevSpace is fantastic for development, and then for deployment I just wrote a little script which builds my images, updates the kustomize with new images and applies the manifests. Pretty simple.
We couldn't use managed clusters because these were running on our own hardware, and they needed to run on the same infrastructure as the CDN itself.
The point of the local clusters was for workloads that needed to run in each data center, and then multiple clusters in the back office for both compliance and operational reasons.
I would not have used a tool like this, though. We used Rancher to manage our clusters.
I might have to look more into that... not very keen on Oracle though.
The only piece that's maybe a little dicey is the single load balancer/gateway. If there's a hiccup in that, then everything goes down.
But I've only blown up my cluster once in like 8 years or something, that's not too bad. It was a learning experience :-)
What other kinds of isolation do you want? I can see maybe a separate staging environment if you want to test gnarly things like that ahead of rolling them out to prod. And I guess maybe they can eat eachother's resources if you don't have request limits nor auto-scaling enabled.
But I'm cheap and managing more clusters sounds like a pain. Then I'd have to deal with more kubectl credentials and what not too.
In the cloud or on prem I suspect folks are having better luck than I did, but also open to being wrong about this.
It's a bit of a mindshift change but essentially whenever you feel the urge to make such a tweak...you've strayed off the golden path & are attempt to do something the wrong way (in Talos world).
I came from k3s so was very used to the whole tweaks spiel too.
Where you do need a custom config pipe in patch commands, not modify the OS. i.e. any and all changes you're feeding in via API so that can be repeatably scripted. The Talos OS is immutable.
It's similar how you'd control a k8s cluster with kubectl...except you're doing that model at OS level. You control it by sending API commands no modifying settings in files. So you don't "tweak" anything. It's a bit of mindset shift I know
Also, fun fact, k3sup is pronounced "ketchup" according to the README[0]
[0]: https://github.com/alexellis/k3sup/blob/master/README.md
1) installing a k3s is just one of the things you want to do with a fresh server, so you can have all of the things bundled as a ansible playbook and k3s will just be a step in it.
2) often you want your infra as a code and be reproducible
It's a cool project, but I didn't think the K3s part was the hard part.
We have been running into lot of issues at production with k3s. There I embarked on journey to writing a kubernetes compliant and equivalent platform in rust with the help of claude [1]. It is a fun little project for now, still figuring out stuff, idea is to keep it minimal and single binary every embedded including CNI, and support various runtimes like docker, containerd etc but also wasm, vms and also jvm.
With all respect, "building it because I want to" and "working toward making (it) production grade" doesn't inspire a ton of confidence. k3s has been part of the CNCF for many years and its developer Darren Shepherd was the founding CTO for both cloud.com and Rancher Labs, which were acquired by Citrix and SUSE. It looks like you're running your own B2B company and hoping to swap out k3s as the underlying engine for multitenancy. That's very risky. Surely Claude can help you understand and use k3s just as readily as help you write a replacement, and I'm sure SUSE sells professional services. I have no clue what they charge but typically you're talking like $300 an hour and you'd probably only need 40 hours.
Once I have embarked on the journey building this from scratch, there are new innovative ideas I can implement not bound to any foundation or org.
Ps. We do not sell it as product it is 100% free and open-source with MIT license.
Architecturally - where do you run Postgres ? I assume it would be external to the cluster ? (doing it internally would create a circular dependency ?)
There were many issues. On top of my mind was, after a DR drill where in a VM was booted, node did not join the cluster. Apart from that bunch of issues due to etcd, longhorn.
Another major one was the CNI stopped work for a particular node. Garbage collection for images was another, we labelled the images, it would still remove then from the node.
Bunch of these kind of issues when our requirement is fairly straightforward. Therefore we are working towards a strip down version.
There is lot of operation complexity in general and most of us can do without.
How large is your operations team?
What you’re saying makes it sound like you’re a one-person operation, or somewhere close to that scale. That obviously doesn’t have same requirements as much larger organizations.
I ran a job last night which provisioned a cluster with 4TB of RAM and nearly 1000 vCPUs. It ran for 20 minutes, ingested about 800 GB of data from nearly an million files, and was then deleted. To do that on a single cluster that’s also used for serving production requests would be unnecessarily complex and risky. Our production system has users in every timezone using the system 24x7. At the very least you’d have to provision separate node pools anyway, but why would you bother to do that?
In this case the urge to make a tweak is synonymous with the urge to make the product _function_.
I admire their dedication to the schtick, but the upshot is that since you cannot reach inside to make Talos actually work in environments that aren't supported by that golden path, running the product on many devices is "Talos Wrong".
That's their perrogotive, but it's obnoxious in a "Windows 11 doesn't work on your perfectly functional laptop" kinda way.
I run Minikube in Podman for dev. And then I use kustomize to customize dev, staging and prod environments. The environments are 99% the same, they just have different env vars and memory limits.
K3s runs on existing operating systems with batteries included.
I've found things more stable if you can give a dedicated interface just for internal k3s communication. It can be a bridge interface on top of a vlan interface - but not the vlan interface itself, or some things will break in very interesting ways. Also, even when using IPv6, just stick with internal IPs and nat everything - touching internal IP ranges is no fun. Plus, if there's a chance you'd ever want to use dual stack, set it up with internal v6 addresses, and just don't use the v6 addresses for now. There's also a lot of unintuitive behaviour around dual stack networking - and lots of areas where documentation is just plain wrong.
I'm scripting our stuff with ansible - one of the more useful things was the realisation that in some areas changes which shouldn't break anything can lead to cluster communication being interrupted, which is a very interesting thing to deal with, especially when you can't pin it to that change that didn't touch anything close to that, and therefore should not be responsible. I've learned, and sprinkled checks to make sure all members can still reach each other in there now, so that at least when I break it on changes I directly know why.
In enterprise setup when you are dpeloying on customer, airgapped, no access to internet and repositories, you generally dont have control over the infrastructure. It can be as hostile as you can think of.
I cannot wait for the end of this month to leave that place.
We are hiring, btw.