eBPF Is Awesome

268 points by filipn 5 years ago | 44 comments

tptacek 5 years ago |

Most examples of BPF code are written in a mix of Python and C using BCC, the "BPF Compiler Collection", which essentially treats all of LLVM and clang as a library callable from Python code.

I can't get my head around using it that way, and have found it pretty straightforward to just write C programs, compiled with clang `-target bpf`. Until very recently, writing anything interesting this way required you to declare all functions inline, compile into a single ELF .o, and, of course, avoid most loops. But most of the kinds of things you'd write in BPF tend not to be especially loopy (you can factor most algorithmic code out into userland, communicating with BPF using maps).

A big issue for this kind of development is kernel compat; struct layouts can change from release to release, for instance. This isn't a problem for us at Fly, because we just run the same kernel everywhere, but it's a real problem if you're trying to ship a tool for other people's systems. But that's changing with CO-RE; recent kernels can export a simplified symbol table in a BPF-legible format called BTF, and the leader can perform relocations. Facebook has written a bunch of good stuff about this:

https://facebookmicrosites.github.io/bpf/blog/2020/02/20/bcc...

lights0123 5 years ago | |

There's also https://github.com/alessandrod/bpf-linker to make compiling a bit easier, as it does necessary inlining at link-time.

chubot 5 years ago | |

I think dtrace has the same problem, i.e. it's pretty tightly coupled to the exact functions / trace points in the kernel. A different kernel can break a dtrace script, although I think their code changes a lot less than Linux does.

It seems somewhat unavoidable, if the goal is to introspect the kernel at a very intimate level ...

bcantrill 5 years ago | | |

No, DTrace does not have this problem, though our solution to it is one of the least well known aspects of DTrace: we have a notion of explicit stability that allows for stable scripts to be built on top of very low level implementation details that themselves might change. See the chapter on "Stability" in the Dynamic Tracing Guide[1] for details.

[1] http://dtrace.org/guide/chp-stab.html

tptacek 5 years ago | | |

BTF is designed to avoid the problem, by having the live kernel export symbols; supposedly --- I haven't used it --- the toolchain even converts it to a header file, so BPF programs just include a "vmlinux.h" instead of includes pointing into kernel source (which is a nightmare). It's ambitious and I'm surprised it does as much as it does but apparently they're solving this problem.

xyzzy_plugh 5 years ago | |

What's worse is that I've run into Kernel bugs/panics a few times that made me hesitate recommended BPF for production systems. Hopefully those become less frequent as the ecosystem matures, but they were pretty scary!

bostonsre 5 years ago | |

Handling input options and displaying output is a little easier in python. It also let's you hack the tools quick and run any changes instantly.

bboreham 5 years ago | |

CO-RE is great, but for those who have to run on older kernels an approach is to loop, guessing the offset and running an experiment to see if correct:

https://github.com/weaveworks/tcptracer-bpf/blob/cd53e7c84ba...

This was done (by Kinvolk) for the visualisation tool Weave Scope; also picked up by DataDog https://github.com/DataDog/datadog-process-agent/tree/master...

noisy_boy 5 years ago | |

I got a bunch of Numba version related errors (Python 3.7) when I tried to run the example code in the website and my thoughts were in the same direction. Was wondering if it is possible to write something like this in, say, Golang instead of Python.

seneca 5 years ago | | |

There are Go bindings for BPF and BCC: https://github.com/iovisor/gobpf

I'm not sure the state of them at this point, but it's the same paradigm GP mentioned.

mhh__ 5 years ago |

Spectre mitigations can make it go from awesome to useful.

The documentation is also pretty dire, but it's mostly implement-once remember-forever in my experience - it's all there but kernel samples are quite hard to read, and I'd rather not guess based on struct listings (e.g. variable length structs aren't particularly fun when you're fumbling around)

ncmncm 5 years ago |

It seems worth mentioning that the code actually executing in the kernel, when it is running your eBPF, is native machine code, ahead-of-time compiled from the bytecode program you gave to the kernel.

0b01 5 years ago | |

Technically it's JITted not AOT compiled.

ithkuil 5 years ago | | |

The line is blurry.

JIT compilation means compiling code on the fly right when you're about to execute it. When the code is part of a larger program, jitting allows to compile the parts of it you actually need and avoid wasting time to compile other parts. It also allows to compile the relevant code paths based on dynamic flow analysis, which often involves interpreting your program the first time you run it and emitting instructions for the next time around (tracing)

If the code unit is small and you know you're going to run it all, you can compile it in one swoop. If you compile it when you load it in the kernel, as opposed to compiling it lazily right before you run it the first time, i think it's fair to call this Ahead-of-Time compilation, even if the compilation happens right next the use site and not as part of the developer tool chain

thomashabets2 5 years ago |

Yup, it is. My recent epiphany-blogpost:

https://blog.habets.se/2020/11/BPF-the-future-of-configs.htm...

buckminster 5 years ago |

> running a user space program inside the kernel

Isn't it actually running a user program in kernel space?

perlgeek 5 years ago |

Is it possible to write device drivers in eBPF?

(I've asked this before, but haven't gotten any response, and no clear answer from Google/DDG either).

mhh__ 5 years ago | |

eBPF isn't Turing complete after being verified so I would assume no.

perlgeek 5 years ago | | |

Do device drivers typically need to be Turing complete? I would have expect drivers for simple USB devices for example to be pretty simple state machines.

polskibus 5 years ago |

What are the benefits of using eBPF besides a promise of observability "for free"?

Can eBPF be used for observability using platforms like Java or .net core, or does their platform VMs obfuscate too much and monitoring them using eBPF is not feasible?

How does eBPF work wrt OpenTelemetry etc.? Should OpenTelemetry be seen as standardized interfaces to which eBPF reports data?

nvarsj 5 years ago | |

eBPF helps with kernel observability - an area that has been sorely lacking in the past. For the JVM or .NET, they give you virtually no insight at all into system calls - so eBPF is complementary to VM profilers, not a replacement. If you ever used Shark on OS X you will get a sense of how cool this is - this was a profiler for the OS X JVM which profiled the system calls as well and combined it all into a single trace tree. Maybe one day we'll get similar profilers on Linux for these systems - with eBPF it should be fairly straightforward.

OpenTelemetry is just a reference API. You could export metrics using eBPF as well. I'm pretty sure Sysdig does this for example.

WatchDog 5 years ago | | |

See Brendan Gregg's excellent work in this space

http://www.brendangregg.com/blog/2014-06-12/java-flame-graph...

javierhonduco 5 years ago | |

It’s definitely possible in some VMs. I’ve been working in a Ruby profiler that collects the stacks from a BPF program [1]. There are some BPF safety mechanisms that require some creativity to overcome such as max instructions, not being Turing complete, etc.

[1]: https://github.com/facebookexperimental/rbperf

knorker 5 years ago |

> The eBPF program is written in a pseudo-C code

Pseudo? This is a nit, but isn't it actually regular C?

waynesonfire 5 years ago |

I just happen to run into a freebsd video on dtrace (similar technology to eBPF, I think) that was created three weeks ago.

https://www.youtube.com/watch?v=E06GVdH-LX0

knorker 5 years ago | |

I think comparing dtrace to eBPF is missing what makes eBPF great. dtrace is just one application that can be implemented using eBPF.

Your toolbox can be used to fix things, but eBPF is a factory for making new types of tools and toolboxes.

eBPF can be used to make small programs that run at tracing points, thus making dtrace. But it can also be made to make packet filter decisions (thus altering what happens), and with at least one network card that eBPF program can be pushed to the network card and filter before the packet even hits RAM, much less the CPU!

eBPF can run at socket init time, and set some default TCP tuning parameters.

Another comment in this thread asked if one can write a whole device driver in eBPF. The answer is actually not clear.

eBPF is more similar to "the ability to load kernel modules" than it is "a tracing framework".

ChrisMarshallNY 5 years ago |

That sounds extremely cool.

Sadly, I don't program in Linux, so I can't use it. :'(

0b01 5 years ago | |

If you program on Windows you should check out Event Tracing for Windows(ETW). Similar to eBPF, ETW is a logging framework inside Windows kernal. Microsoft.Diagnostics.Tracing.TraceEvent[0] is a nice nuget package for logging and analyzing ETL files.

[0]https://github.com/microsoft/perfview/blob/master/documentat...

laserbeam 5 years ago | | |

But only after reading this glorious and funny article about using ETW for logging thread context switches. https://caseymuratori.com/blog_0025

tooltower 5 years ago | |

If you just want to learn and try it, you can always do it in a Linux VM.

My general development skill (in Linux or otherwise) has definitely improved since I became a Linux native. But that didn't happen overnight.

0mp 5 years ago | |

You may try out generic eBPF outside of Linux: https://github.com/generic-ebpf/generic-ebpf

0b01 5 years ago | | |

LLVM also has a BPF backend so you can compile any C++/C program to run on BPF.

jks 5 years ago | |

If you're using an old enough MacOS X (I think 10.12 or older), DTrace has similar functionality. Unfortunately it has been broken in recent MacOS versions, at least unless you disable SIP.