IPvlan overlay-free Kubernetes Networking in AWS

IPvlan overlay-free Kubernetes Networking in AWS(eng.lyft.com)

144 points by theatrus2 8 years ago | 56 comments

ggm 8 years ago |

I know this invites some eye-rolling, but can somebody explain to me why the k8s people insist on ignoring IPv6 and the possibilities of large address fields?

Down the bottom, which is where 'things we probably will never do' is when IPv6 comes in the door.

Azure (for instance) is a fully IPv6 enabled fabric. Microsoft "get" IPv6. They are all over it. They understand it, its baked into the DNA. So how come K8s people just kind of think "yea.. nah.. not right now"?

Because proxy Ipv6 at the edge is really sucky. We should be using native IPv6, preserve e2e under whatever routing model we need for reliability, and gateway the V4 through proxies in the longer term.

(serious Q btw)

andrewstuart2 8 years ago | |

They're not ignoring it. It's being actively worked on, and is expected to be in alpha for the 1.9 release [1].

The issue [2] has existed for over 3 years, so it's not a new suggestion.

[1] https://github.com/kubernetes/features/issues/508

[2] https://github.com/kubernetes/kubernetes/issues/1443

jamiesonbecker 8 years ago | | |

> "not ignoring it" .. "actively worked on" .. "for over 3 years" ..

Not to diminish the very real challenges in getting IPv6 implemented, but this is an interesting turn of phrase.. especially because rolling out IPv6 would actually solve a whole class of problems (and I'm not even a particularly big advocate of the need for IPv6, since most things should still be NATed anyway.)

(And especially considering parent's phrase "baked into the DNA" at Azure.)

sargun 8 years ago | |

Independently from this, IPv6 doesn't really work in AWS VPC. For example, ELB / ALB breaks under IPv6 endpoints. A lot of VPC services aren't available on IPv6. Metadata service has no IPv6 equivalent. I'm sure someone at AWS is thinking of these problems, but unfortunately, there are turtles all the way down.

fulafel 8 years ago | |

Is the lack of e2e IP networking a chicken and egg problem? The built-in assumption of NAT islands w/ ambiguous addresses seems to be widespread in container and virtualization platforms, with little support for Internet style networking despite the obvious security and simplicity advantages.

I guess even today many people have problems getting more than a /64 in the office or home network (edit: it's supported usually with the prefix delegation option in DHCPv6 by most ISPs), so it's not frictionless in the dev environment.

puzzle 8 years ago | |

Part of it is that Google took forever to migrate to IPv6 internally, well after the user facing support.

ggm 8 years ago | | |

yes. we jumped ship from self-deployed kubes on Linode, to Google Cloud once they provided an external V6 face in the LB. the v6 story in google is complicated. (we jumped to google because the integration of their tools and kubectl was too good to ignore. Most things just work. Alas ipv6 inside the pod ecosystem is not one of them yet)

glenjamin 8 years ago | |

Do you have any resources that expand on the IPv6 support in Azure?

Everything I've seen in their networking configuration screens and APIs appears to only allow IPv4 addresses.

ggm 8 years ago | | |

Do you have any resources that expand on the IPv6 support in Azure?

Alas no. When I looked at k8s/Azure the IPv6 support was new.

My comment about IPv6 'baked into the DNA' of Microsoft is about Microsoft not Azure -A lot of the work on privacy addresses, the deployment of Teredo, adoption of ULA addresses, comes from people inside Microsoft, And they have been presenting recently at NANOG and IETF on IPv6 only deployments in the Redmond campus.

deepakjois 8 years ago |

Not directly related, but can someone recommend a beginners resource to understand Kubernetes networking? There are some good ones out there that explain basic Kubernetes concepts like pods, replicas etc. But networking seems to be a more complicated topic, and most intro guides skip over it.

muxator 8 years ago |

For those wondering what's the difference between macvlan and ipvlan, the main ipvlan paper [0] summarizes its raison d'être:

> This is especially problematic where the connected next-hop e.g. switch is expecting frames from a specific mac from a specific port.

e.g.: if the host is attached to a managed switch with a strict security policy, macvlan would not work.

[0] https://www.netdevconf.org/0.1/sessions/28.html

KaiserPro 8 years ago | |

This is what I'm trying to understand. Macvlan appears to be a much better solution as it allows 1-1 mapping and piggybacking onto all the automatic/set&forget mechanisms that AWS provides.

Obviously it needs a switch at the otherside that can handle a huge and quick changing arp table. Also if you have mac address limiting typical on edge switches, its a non flyer

puzzle 8 years ago | |

If you need distinct MAC addresses, though, for e.g. DHCP, usually not the case with containers, then you have to use macvlan.

SEJeff 8 years ago |

Just wanted to give a shout out to kube-router[1], a really fantastic solution if you want to use BGP, that will soon support not needing bgp by implementing a featureset similar to flannel's hostgw support. They are really good about addressing things in the open on their github[2]. BGP is, by definition, "web scale" as it runs most routing for the internet. Lower latency and much higher throughput than any sort of overlay network.

[1] https://www.kube-router.io

[2] https://github.com/cloudnativelabs/kube-router

hueving 8 years ago | |

Saying BGP is "web scale" is a bit misleading because it has to be very carefully aggregated for it to route the entire internet.

If you do something like advertise a /32 for each container you can very quickly fill up TCAMs on your network hardware (in particular cheap top of rack switches that are pervasive in data centers).

The entire v4 internet is something like 600k prefixes right now and the routers that can handle that many prefixes at line rate are irritatingly expensive. ToRs as of a couple of years ago when I last tested this would fall over at 1-10k prefixes.

So be careful when looking at BGP solutions because it's very easy to have a BGP topology that doesn't scale, despite it being the exchange protocol for the Internet.

jlgaddis 8 years ago | | |

In addition to what SEJeff said, as long as you design your IP addressing correctly you'll be fine. By that, I mean hierarchically. Just divide up whatever IP network you're using (e.g. 10/8) and make sure you allocate "enough" to each rack/whatever.

Assuming everything is nice and hierarchical, you can easily aggregate an entire rack to a single prefix. Even the shitty ToR switches can usually handle a couple thousand prefixes, which should be plenty if done correctly.

Obviously you shouldn't be advertising /32s.

> The entire v4 internet is something like 600k prefixes right now ...

Just checked my edge routers and it looks like we're up to ~671k prefixes here and that number is still increasing everyday.

paxy 8 years ago |

> Announcing cni-ipvlan-vpc-k8s

Rolls right off the tongue, doesn't it?

andrewstuart2 8 years ago | |

CIVK, pronounced civic?

frogperson 8 years ago | | |

Are doing acronyms of acronyms now?

chris_marino 8 years ago |

It all about trade offs. We've built a CNI for k8s and have looked into all of the techniques described. It seems that Lyft's design is a direct reflection of their requirements.

To the extent your requirement match theirs, this could be a good alternative. The most significant in my mind is that it's meant to be used in conjunction with Envoy. Envoy itself has its own set of design tradeoffs as well.

For example, Lyft currently uses 'service-assigned EC2 instances'. Not hard to see how this starting point would influence the design. The Envoy/Istio model of proxy per pod also reflects this kind of workload partitioning. Obviously, a design for a small number of pods (each with their own proxy) per instance is going to be very different from one that needs to handle 100 pods (and their IPs), or more, per instance.

Another is that k8s network policy can't be applied since the 'Kubernetes Services see connections from a node’s source IP instead of the Pod’s source IP'. But I don't think this CNI is intended to work with any other network policy API enforcement mechanism. Romana (the project I work on) and the other CNI providers that use iptables to enforce network policy rely on seeing the pod's source IP.

Again, this might be fine if you're running Envoy. On the other hand, L3 filtering on the host might be important.

Also, this design requires that 'CNI plugins communicate with AWS networking APIs to provision network resources for Pods'. This may or may not be something you want your instances to do.

FWIW, Romana lets you build clusters larger than 50 nodes without an overlay or more 'exotic networking techniques' or 'massive' complexity. It does it via simple route aggregation, completely standard networking.

warp_factor 8 years ago | |

Not all NetworkPolicy implementations base themselves on Source//Destination IPs. I can think specifically of Trireme//Cilium that are using metadata in order to enable policies.

chris_marino 8 years ago | | |

I knew that. What I didn't know was if either of these could apply network policy to these endpoints. Guessing that since they each require their own CNI, there will be probs. So, whether the CNI uses iptables, or not, not clear how network policy API can be enforced.

bogomipz 8 years ago |

The author states:

>"Unfortunately, AWS’s VPC product has a default maximum of 50 non-propagated routes per route table, which can be increased up to a hard limit of 100 routes at the cost of potentially reducing network performance."

Could someone explain why increasing from 50 to 100 non-propagated routes in a VPC results in network performance degradation?

netingle 8 years ago |

IIUC ENIs are limited to 2 per host on small instances, 15 per host on larger ones. Doesn't this approach limit the number of Pods per host? I'm already running about 20 pods per host, and I don't more containers per host is atypical.

tamalsaha001 8 years ago |

How does it compare to AWS' own CNI plugin? https://github.com/aws/amazon-vpc-cni-k8s

lambda 8 years ago | |

If you read the article, you'll see:

> Lincoln Stoll’s k8s-vpcnet, and more recently, Amazon’s amazon-vpc-cni-k8s CNI stacks use Elastic Network Interfaces (ENIs) and secondary private IPs to achieve an overlay-free AWS VPC-native solutions for Kubernetes networking. While both of these solutions achieve the same base goal of drastically simplifying the network complexity of deploying Kubernetes at scale on AWS, they do not focus on minimizing network latency and kernel overhead as part of implementing a compliant networking stack.

bogomipz 8 years ago | | |

But if you follow the link the under the "components" section of your link you will see the link to the project proposal:

https://github.com/aws/amazon-vpc-cni-k8s/blob/master/propos...

There they clearly state:

>"To run Kubernetes over AWS VPC, we would like to reach following additional goals:

Networking for Pods must support high throughput and availability, low latency and minimal jitter comparable to the characteristics a user would get from EC2 networking"

scurvy 8 years ago | | |

How does it compare with Romana? They added a VPC router specifically for large K8 clusters on AWS.

https://github.com/romana/vpc-router