Scaling Kubernetes to 2,500 Nodes

Scaling Kubernetes to 2,500 Nodes(blog.openai.com)

265 points by stanzheng 8 years ago | 33 comments

SEJeff 8 years ago |

This is a really fantastic set of general "how to tune kubernetes and the various components for large clusters". Thanks for writing this up!

drewrobb 8 years ago |

I'm surprised that the scaling story of k8s/(+etcd?) is still so far behind mesos/zk. There have been mesos clusters at over 10k Nodes for several years now.

I have never personally needed more than a few hundred mesos agents, but these have been added without any noticeable impact on our extremely modestly provisioned (and multi purpose) zk cluster or any other components.

Has anyone used both systems and can speak to any advantages of k8s for these types of workloads?

Also is anyone using some kind of torrent approach as a more reasonable solution to avoid network bottlenecks when distributing big docker images to a large number of nodes?

jo909 8 years ago | |

A lot of the issues were kind of "external" and while worth thinking about for every deployment, not really something the k8s project can do much about other than warn in the documentation.

  - disk latency
  - monitoring queries
  - homemade autoscaler killing all etcd nodes
  - custom scheduling policy moving many kubedns processes to the same node
  - unusually large docker images
  - "sharing" gcr.io request quotas because of Azure NAT IPs

That's not to say that Mesos is not indeed scaling better or easier. I don't know enough about Mesos.

merb 8 years ago |

what I find amazing about k8s is that it's one of the first solution that is relativly simple for a small cluster (HA, while schedule stuff on the masters), but can scale amazingly well even for a big cluster. you can start with 3 nodes with like 8gb per machine (or less, I guess even 2gb is feasible if you only want to use like 1-1,5gb of memory per machine). (non ha can of course be smaller)

scarface74 8 years ago | |

The Nomad executable is self contained and less than 15MB if I remember correctly. It can be used with Docker containers, shell scripts, or just raw executables.

https://www.hashicorp.com/c1m.html

It was dead simple to install and use compared to my brief experience with k8s.

roscoebeezie 8 years ago |

As a person who doesn’t understand containers, where do I go to learn the basics?

nickjj 8 years ago | |

If you want to avoid doing a bunch of research on your own I've put together an up to date self paced video course at https://diveintodocker.com/.

It covers everything from "What is Docker?" to learning how to apply it to your own projects. There's a tiny bit of theory, followed by lots of guided labs and examples.

In case you're curious, I've been using Docker in development and production since 2014 and am also a Docker Captain (TL;DR is Docker reached out to me to join their team as a trusted content provider).

twakefield 8 years ago | |

We have Docker 101 [0] and Kubernetes 101 [1] tutorials on our github repo which may be helpful, assuming you have an understanding of Linux. Or checkout the resources on Docker's website [2].

[0] https://github.com/gravitational/workshop/blob/master/docker...

[1] https://github.com/gravitational/workshop/blob/master/k8s101...

[2] https://www.docker.com/what-docker

oso2k 8 years ago | |

DISCLAIMER: I'm a Consulting Architect at Red Hat in our Container and PaaS Practice Group. An expert, knowledge building, & community building/supporting group. A Community of Practice, we call it.

We (Red Hat) make the following references (beyond what our training & docs provide) available to our consultants, customers, and world at-large. BTW, if something isn't clear, is wrong, or you want to discuss a point, reach out to us on GitHub. Just about all of our products, software, and documentation are up on GitHub.

I'd also recommend playing with Minishift or Minikube. Great way to put a quick sandbox on your laptop.

The source GitHub Repo: https://github.com/redhat-cop/openshift-playbooks

Building Blocks of OpenShift (& Kubernetes): http://v1.uncontained.io/playbooks/fundamentals/building_blo...

Docker Fundamentals Reference: http://v1.uncontained.io/playbooks/fundamentals/docker_refer...

Minishift: https://docs.openshift.org/latest/minishift/getting-started/...

emmelaich 8 years ago | |

The defining characteristic of containers are constrained processes (constrained by one or more of memory, cpu, access controls..); isolated namespaces (your own tmp, root, users, process list) plus some (possibly limited) additional isolation from the rest of the hosts files and hardware.

Docker adds to this 1. a packaging that lets you define what goes into a container (Dockerfile) and format (docker image) - which can be downloaded, extracted, manipulated and uploaded. 2. a way to stop (freeze) and start (thaw) a container 3. tools for controlling network capabilities within a container and between the host and container or other containers on other hosts.

Those are the essentials.

Kubernetes (and other tools) expand on this in terms of orchestration -- especially the internetworking aspect but also failover and load balancing.

frakkingcylons 8 years ago | |

Docker has a good overview in my opinion:

https://www.docker.com/what-container

btmiller 8 years ago | |

At the fundamental layer, start with cgroups and namespaces. If you're looking for practical knowledge, start with making and running a Docker image - usually a basic "hello world" web server (nginx) is a good one to start on :)

krat0sprakhar 8 years ago | |

Here's a Docker tutorial I wrote targeted at beginners: https://docker-curriculum.com/

DannyB2 8 years ago | |

Once you understand the basics of Docker, which is containers, then look at Kubernetes. One place you might start is with spinning up a single VM with RancherOS.

djb_hackernews 8 years ago |

350TB of memory, and 50,000 cores, nice.

ARP caching seems to be a common issue in cloud environments. AWS recommends turning it off and does so itself in their Amazon Linux distro.

myrandomcomment 8 years ago |

Ran into the ARP scale issues when trying to put 1000 containers on a system for scale testing over year ago. strace helped figure out where the issues was and what settings to change. I guess I should have sent an email to the mailing list. At that time if you searched for scaling to 1000 docker contains was a failed search, as it was "hey here is how I scaled to 1000 containers over X numbers of nodes". No one was crazy enough to try to get 1000 on a single machine.

eggie5 8 years ago |

Does OpenAI train w/ GPUs on k8s clusters?

thesandlord 8 years ago | |

According to the article, they are using NC24 VMS, which have 4 K80s attached. So yes, I would assume they are using GPUs.

Check out https://github.com/google/kubeflow if you are interested in doing the same.

(Disclaimer: I work for GCP doing K8s stuff, I know GKE clusters support GPUs and Kubeflow, not 100% sure if AKS supports it or if you need to set up your own cluster like OpenAI did.)

frakkingcylons 8 years ago | | |

I have a somewhat off-topic question as a complete TensorFlow beginner and it seems like you'd be in the know:

If I want to train a TF model distributed over many machines in GCP, it seems like I could use Cloud ML Engine or deploy Kubeflow to a K8s cluster running in GKE and train it there.

What should I consider when choosing between these two options? Is there another option I should consider?

raccoonone 8 years ago | |

Yep, we've been using GPUs for quite a while (even before the alpha support in Kube), both the K80s in Azure and some Pascals in our own clusters. With the support in Kube now it's quite seamless.

eggie5 8 years ago | | |

That's some groundbreaking work: GPUs in K8s. By Kube you mean the KubeFlow project?

EDevil 8 years ago |

Isn’t it a problem to have etcd store its state on a non persistent volume?

How do they recover it after a restart? I suppose it's not a manual process.

justingood 8 years ago | |

The replacement machine will start pulling its data from the remaining nodes when it joins the cluster. However, it's recommended to migrate the failed node's data first if it's greater than 50MB: https://github.com/coreos/etcd/blob/master/Documentation/op-...

bdburns 8 years ago |

(Azure containers lead here) Awesome to see OpenAI scale Kubernetes on Azure!

Findeton 8 years ago | |

(in Azure) Are VMs still tied to a specific machine and thus if the machine goes down the VMs need be restarted?

rotrux 8 years ago | | |

No. I work exclusively with scaling Linux workloads in Azure and local failover happens fairly regularly without the user ever seeing any indication. The hypervisor has its own cache which is tied more to your storage acct & runtime data-disks than the VM itself. This is true even with the lowest cost storage option: LRS standard blob-storage.

Howwwwever, LRS storage does not save you if the whole datacenter goes down, or during scheduled maintenance (when the whole datacenter is down.) For that, you'll need ZRS (which does failover to a co-lo in the event of the primary datacenter going down) or GRS (for which you can configure/test your failover options.)

Also, Microsoft's strength is in their PaaS services, like app-service or Azure-Functions. Those usually have CosmosDB on the backend, which is pretty much the best failover/DB-availability server-software on the market in my opinion.

bdburns 8 years ago | | |

VMs in all clouds are always tied to specific machines. If that machine fails unexpectedly then those VMs will restart. If it is a controlled reboot (e.g. host update) then they may not restart...

- disk latency - monitoring queries - homemade autoscaler killing all etcd nodes - custom scheduling policy moving many kubedns processes to the same node - unusually large docker images - "sharing" gcr.io request quotas because of Azure NAT IPs