Scaling Kubernetes to 2,500 Nodes(blog.openai.com) |
Scaling Kubernetes to 2,500 Nodes(blog.openai.com) |
I have never personally needed more than a few hundred mesos agents, but these have been added without any noticeable impact on our extremely modestly provisioned (and multi purpose) zk cluster or any other components.
Has anyone used both systems and can speak to any advantages of k8s for these types of workloads?
Also is anyone using some kind of torrent approach as a more reasonable solution to avoid network bottlenecks when distributing big docker images to a large number of nodes?
- disk latency
- monitoring queries
- homemade autoscaler killing all etcd nodes
- custom scheduling policy moving many kubedns processes to the same node
- unusually large docker images
- "sharing" gcr.io request quotas because of Azure NAT IPs
That's not to say that Mesos is not indeed scaling better or easier. I don't know enough about Mesos.https://www.hashicorp.com/c1m.html
It was dead simple to install and use compared to my brief experience with k8s.
It covers everything from "What is Docker?" to learning how to apply it to your own projects. There's a tiny bit of theory, followed by lots of guided labs and examples.
In case you're curious, I've been using Docker in development and production since 2014 and am also a Docker Captain (TL;DR is Docker reached out to me to join their team as a trusted content provider).
[0] https://github.com/gravitational/workshop/blob/master/docker...
[1] https://github.com/gravitational/workshop/blob/master/k8s101...
We (Red Hat) make the following references (beyond what our training & docs provide) available to our consultants, customers, and world at-large. BTW, if something isn't clear, is wrong, or you want to discuss a point, reach out to us on GitHub. Just about all of our products, software, and documentation are up on GitHub.
I'd also recommend playing with Minishift or Minikube. Great way to put a quick sandbox on your laptop.
The source GitHub Repo: https://github.com/redhat-cop/openshift-playbooks
Building Blocks of OpenShift (& Kubernetes): http://v1.uncontained.io/playbooks/fundamentals/building_blo...
Docker Fundamentals Reference: http://v1.uncontained.io/playbooks/fundamentals/docker_refer...
Minishift: https://docs.openshift.org/latest/minishift/getting-started/...
Docker adds to this 1. a packaging that lets you define what goes into a container (Dockerfile) and format (docker image) - which can be downloaded, extracted, manipulated and uploaded. 2. a way to stop (freeze) and start (thaw) a container 3. tools for controlling network capabilities within a container and between the host and container or other containers on other hosts.
Those are the essentials.
Kubernetes (and other tools) expand on this in terms of orchestration -- especially the internetworking aspect but also failover and load balancing.
ARP caching seems to be a common issue in cloud environments. AWS recommends turning it off and does so itself in their Amazon Linux distro.
Check out https://github.com/google/kubeflow if you are interested in doing the same.
(Disclaimer: I work for GCP doing K8s stuff, I know GKE clusters support GPUs and Kubeflow, not 100% sure if AKS supports it or if you need to set up your own cluster like OpenAI did.)
If I want to train a TF model distributed over many machines in GCP, it seems like I could use Cloud ML Engine or deploy Kubeflow to a K8s cluster running in GKE and train it there.
What should I consider when choosing between these two options? Is there another option I should consider?
How do they recover it after a restart? I suppose it's not a manual process.
Howwwwever, LRS storage does not save you if the whole datacenter goes down, or during scheduled maintenance (when the whole datacenter is down.) For that, you'll need ZRS (which does failover to a co-lo in the event of the primary datacenter going down) or GRS (for which you can configure/test your failover options.)
Also, Microsoft's strength is in their PaaS services, like app-service or Azure-Functions. Those usually have CosmosDB on the backend, which is pretty much the best failover/DB-availability server-software on the market in my opinion.
Btw: we are currently preparing an open-source release
Disclaimer: I am co-founder at RiseML
ML Engine is an order of magnitude easier. You just have to do step 2 and setup your model to employ multiple GPUs.
https://docs.microsoft.com/en-us/azure/virtual-machines/wind...
I've found actual reboots to be rare - the exception being the recent Spectre / Meltdown patching.