No More Blue Fridays

481 points by moreati 1 year ago | 270 comments

mrpippy 1 year ago |

> Once Microsoft's eBPF support for Windows becomes production-ready, Windows security software can be ported to eBPF as well.

This doesn’t seem grounded in reality. If you follow the link to the “hooks” that Windows eBPF makes available [1], it’s just for incoming packets and socket operations. IOW, MS is expecting you to use the Berkeley Packet Filter for packet filtering. Not for filtering I/O, or object creation/use, or any of the other million places a driver like Crowdstrike’s hooks into the NT kernel.

In addition, they need to be in the kernel in order to monitor all the other 3rd party garbage running in kernel-space. ELAM (early-launch anti-malware) loads anti-malware drivers first so they can monitor everything that other drivers do. I highly doubt this is available to eBPF.

If Microsoft intends eBPF to be used to replace kernel-space anti-malware drivers, they have a long, long way to go.

[1]: https://microsoft.github.io/ebpf-for-windows/ebpf__structs_8...

brendangregg 1 year ago | |

Yes, we know eBPF must attach to equivalent events to Linux, but given there are already many event sources and consumers in Windows, the work is to make eBPF another consumer -- not to invent instrumentation frameworks from scratch.

Just to use an analogy: Imagine people do their banking on JavaScript websites with Google Chrome, but if they use Microsoft Edge it says "JavaScript isn't supported, please download and run this .EXE". I'm not sure we'd be asking "if" Microsoft would support JavaScript (or eBPF), but "when."

surajrmal 1 year ago | | |

This assumes eBPF becomes the standard. It's not clear Microsoft wants that. They could create something else which integrates with dot net and push for that instead.

Also this problem of too much software running in the kernel in an unbounded manner has long existed. Why should Microsoft suddenly invest in solving it on Windows?

doctorpangloss 1 year ago | | |

Windows development on eBPF is slower than Linux development on eBPF, so it will never be supported. A source code user licensee could develop it faster, but who licenses Windows source and already has great eBPF experience?

nullindividual 1 year ago | |

Microsoft already has an extensible file system filter capability in place, which is what current AV uses. Does it make sense to add eBPF on top of that and if so, are there any performance downsides, like we see with file system filters?

mauvehaus 1 year ago | | |

They've done a technology transition once already from legacy file system filter drivers to the minifilter model. If they see enough benefit to another change, it wouldn't be unprecedented.

Mind you, it looks like after 20-ish years Windows still supports loading legacy filter drivers. Given the considerable work that goes into getting even a simple filesystem minifilter driver working reliably, it's safe to assume that we'd be looking at a similarly protracted transition period.

As to the performance, I don't think the raw infrastructure to support minifilters is the major performance hit. The work the drivers themselves end up doing tends to be the bigger hit in my experience.

Some background for the curious:

https://www.osr.com/nt-insider/2019-issue1/the-state-of-wind...

shahahqq 1 year ago | |

I hope though that Microsoft will double down on their eBPF support for Windows after this incident.

benfortuna 1 year ago | | |

Keep in mind they don't just allow any old code to execute in the kernel.

They do have rigorous tests (WHQL), it's just Crowdstrike decided that was too burdensome for their frequent updates, and decided to inject code from config files (thus bypassing the control).

The fault here is entirely with Crowdstrike.

stackskipton 1 year ago | | |

Doubt it. Microsoft is clearly over Windows. They continue to produce it but every release feels like "Ugh, fine, since you are paying me a ton of money."

Internally, Microsoft is running more and more workloads on Linux and externally, I've had .Net team tell me more than once that Linux is preferred environment for .Net. SQL Server team continues to push hard for Linux compatibility with every release.

EDIT: Windows Desktop gets more love because they clearly see that as important market. I'm talking more Windows Server.

kevin_nisbet 1 year ago |

I hate to dispute with someone like Brendan Gregg, but I'm hoping vendors in this space take a more holistic approach to investigating the complete failure chain. I personally tend to get cautious when there is a proposal that x will solve the problem that occurred on y date, especially 3 days after the failure. It may be true, but if we don't do the analysis we could leave ourselves open to blindspots. There may also be plenty of alternative approaches that should be considered and appropriately discarded.

I think the part I specifically dispute is the only negative outcome is wasted CPU cycles. That's likely the case for the class of bug, but there are plenty of failure modes where a bad ruleset could badly brick a system and make it hard to recover.

That's not to say eBPF based security modules isn't the right choice for many vendors, just that let's understand what risks they do and do not avoid, and what part of the failure chain they particularly address.

mirashii 1 year ago | |

Just because you have not been aware of the discussions on this topic that have been happening for years, doesn't mean that they haven't been happening. This isn't some new analysis formed 3 days after an incident, this is the generally accepted consensus among many experts who have been working in the space, introducing these new APIs specifically to improve stability, security, etc. of systems.

ohmyiv 1 year ago | |

> I personally tend to get cautious when there is a proposal that x will solve the problem that occurred on y date, especially 3 days after the failure.

Microsoft has been working on eBPF for a few years at least.

https://opensource.microsoft.com/blog/2021/05/10/making-ebpf...

https://lwn.net/Articles/857215/

If you're really concerned, they have discussions and communication channels where you're invited to air your concerns. They're listed on their github:

https://github.com/microsoft/ebpf-for-windows

Who knows, maybe they already have answers to your concerns. If not, they can address them there.

kayo_20211030 1 year ago |

This isn't right. If I need a system to run with a piece of code, then it shouldn't run at all if that piece of code is broken. Ignoring the failure is perverse. Let's say that the driver code ensures that some medical machine has safety locks (safeguards) in place to make sure that piece of equipment won't fry you to a crisp; I'd prefer that the whole thing not run at all rather than blithely operate with the safeguards disabled. It's turtles all the way down.

amluto 1 year ago |

> In the future, computers will not crash due to bad software updates, even those updates that involve kernel code. In the future, these updates will push eBPF code.

eBPF is fantastic, and it can be used for many purposes and improve a lot of things, but this is IMO overselling it. Assuming that BPF itself it free of bugs, it’s still a rather large sprawl of kernel hooks, and those hooks invoke eBPF code, which can call right back into the kernel. Here’s a list:

https://www.man7.org/linux/man-pages/man7/bpf-helpers.7.html

bpf_probe_read_kernel() is particularly heavily used, and it is not safe. It tries fairly hard not to OOPS or crash, but it is definitely not perfect.

The rest of that list contains plenty of this that will easily take down a system, even if it doesn’t actually oops or panic in the process.

And, of course, any tool that detects userspace “malicious behavior” and stops it can start calling everything malicious, and the computer becomes unusable.

Meanwhile, eBPF has no real security model on the userspace side. Actual attachment of an eBPF program goes through the bpf() syscall, not through sensibly permissioned operations on the underlying kernel objects being attached to, and there is nothing whatsoever that confines eBPF to, say, a container that uses it. (See bpf_probe_read_kernel() -- it's fundamentally able to read all kernel memory.)

So, IMO, most of the benefit of eBPF over ordinary kernel C code is that eBPF is kind of like writing code in a safe language with a limited unsafe API surface. It's a huge improvement for this sort of work, but it is not perfect by any means.

> The verifier is rigorous -- the Linux implementation has over 20,000 lines of code

The verifier is absurdly complex. I'd rather see something based on formal methods than 20kLOC of hand-written logic.

umanwizard 1 year ago | |

How is it possible to panic using bpf_probe_read_kernel ? Can you give an example that works on the current kernel version?

amluto 1 year ago | | |

I'm not sure that "panic" is the right word here. bpf_probe_read_kernel boils down to copy_from_kernel_nofault, which checks for an "allowed" address and then does the access. Any page faults turn into error returns instead of OOPSes. x86 disallows user addresses, the vsyscall page, and non canonical addresses.

Doing this from bpf assumes that all "allowed" addresses are side-effect-free and will either succeed or cleanly fault. Off the top of my head, MMIO space (including, oddities like the APIC page on CPUs that still have that) and TDX memory are not in this category.

uticus 1 year ago |

> eBPF programs cannot crash the entire system because they are safety-checked by a software verifier and are effectively run in a sandbox.

Isn’t one of the purposes of an OS to police software? I get that this has to do with the OS itself, but what does watching the watchers accomplish other than adding a layer which must then be watched?

Why not reduce complexity instead of naively trusting that the new complexity will be better long term?

riskable 1 year ago | |

eBPF isn't "watching the watchers" it's just a tool that lets other tools access low-level things in the kernel via a very picky sandbox. Think of it like this:

Old way: Load kernel driver, hook into bazillions of system calls (doing whatever it is you want to do), pray you don't screw anything up (otherwise you can get a panic though not necessarily--Linux is quite robust).

eBPF way: Just ask eBPF to tell you what you want by giving it some eBPF-specific instructions.

There's a rundown on how it works here: https://ebpf.io/what-is-ebpf/

uticus 1 year ago | | |

> eBPF isn't "watching the watchers"…

> …via a very picky sandbox…

When the eBPF is a CrowdStrike mechanism, and eBPF is “picky,” it is clearly “watching the watchers.”

MetaWhirledPeas 1 year ago | |

Right? I might spend a few minutes seeing if an AI chatbot can explain all the justifications that lead to using something like CrowdStrike in the first place.

brundolf 1 year ago |

This sounds like a cool technology, but this was the really egregious problem:

> There are other ways to reduce risks during software deployment that can be employed as well: canary testing, staged rollouts, and "resilience engineering" in general

You don't need a new technology to implement basic industry-standard quality control

__MatrixMan__ 1 year ago |

Maybe we should start taking Fridays off to commemorate the event, which probably would have been less bad if more people spent less time with their nose to the grindstone and had more time to stop and think about how it all was shaping up and how they could influence that shape.

muth02446 1 year ago |

```The verifier is rigorous -- the Linux implementation has over 20,000 lines of code -- with contributions from industry (e.g., Meta, Isovalent, Google) and academia (e.g., Rutgers University, University of Washington). The safety this provides is a key benefit of eBPF, along with heightened security and lower resource usage. ``` Wow, 20k is not exactly encouraging. Besides the extra attack surface, who can vouch for such a large code base?

haberman 1 year ago | |

I had exactly the same thought. I don’t know if that 20k number was supposed to inspire confidence, but for me it did the opposite. It would have inspired confidence if it was 300 lines of code.

My impression is that the WebAssembly verifier is much simpler.

the8472 1 year ago |

If the filters are loaded at boot and hook into everything then a bug can still lock down the system to a point where it can't be operated or patched anymore (e.g. because you loaded an empty whitelist). So it could end up replacing a boot loop with another form of DoS.

If microsoft includes a hardcoded whitelist that covers some essentials needed for recovery that could make a bug in such a tool easier to fix, but could still cause effective downtimes (system running but unusuable) until such a fix is delivered.

throwaway2037 1 year ago |

The blog post says:

    > eBPF, which is immune to such crashes.

I tried to Google about this, but I cannot find anything definitive. It looks like you can still break things. Can an expert on eBPF please comment on this claim? This is the best that I could find: https://stackoverflow.com/questions/70403212/why-is-ebpf-sai...

umanwizard 1 year ago | |

eBPF programs cannot crash the kernel, assuming there are no bugs in the eBPF verifier. There have been such bugs in the past but they seem to be getting more and more rare.

javierhonduco 1 year ago | | |

Or in other parts of the kernel. It's been the case in multiple occasions that buggy locking (or more generalised, missing 'resource' release) has caused problems for perfectly safe BPF programs. For example, see https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1033398 and the fix https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

rwmj 1 year ago | | |

This isn't really true. eBPF programs in Linux have access to a large set of helper functions written in plain C. https://lwn.net/Articles/856005/

queuebert 1 year ago | | |

I would be very hesitant to say "cannot" in a million-line C code base.

kaliszad 1 year ago |

"These security agents will then be safe and unable to cause a Windows kernel crash."

Unless of course there is a bug in eBPF (https://access.redhat.com/solutions/7068083) @brendangregg and the kernel panics/ BSoDs anyway which you mention later in the article of course.

acdha 1 year ago | |

This is true but the kernel gets more scrutiny and has better priorities. Only CrowdStrike audits and hardens the CS kernel driver, so things like proactive improvements are competing in a single Jira board against marketing’s request for new features (want to bet that was all AI until Friday?) whereas the kernel eBPF implementation might be improved by people at other security vendors, distributions like Red Hat or Ubuntu or a major cloud provider (all of whom fund serious security audits and have engineers who care a lot about robustness), or academic researchers.

“Many eyes” is a bit dubious in general but the Linux kernel is pretty much the best case for it being true.

ec109685 1 year ago | |

Benefit of fixing that bug is that all ebpf programs benefit versus every security vendor needing to ensure they write perfect c code.

xg15 1 year ago |

> In the future, computers will not crash due to bad software updates, even those updates that involve kernel code. In the future, these updates will push eBPF code.

Assuming every security critical system will be on a recent enough kernel to support this...

efee22 1 year ago | |

I think with a LTS distribution you should get very far these days when it comes to implementing such sensors.

chasil 1 year ago | | |

On rhel8 variants, you can use the Oracle UEK to get eBPF.

https://blogs.oracle.com/linux/post/oracle-linux-and-bpf

  $ cat /etc/redhat-release /etc/oracle-release /proc/version
  Red Hat Enterprise Linux release 8.10 (Ootpa)
  Oracle Linux Server release 8.10
  Linux version 5.15.0-203.146.5.1.el8uek.x86_64 (mockbuild@host-100-100-224-48) (gcc (GCC) 11.2.1 20220127 (Red Hat 11.2.1-9.2.0.1), GNU ld version 2.36.1-4.0.1.el8_6) #2 SMP Thu Feb 8 17:14:39 PST 2024

dredmorbius 1 year ago | |

Considering the number of systems running very obsolete OSes these days: WinNT (4x or 3x), Windows, DOS, or various proprietary Unixen, stale Linux flavours, etc., etc., ... yes, quite.

dijit 1 year ago | |

And assuming there's no bugs in the BPF code...

Oh wait: https://news.ycombinator.com/item?id=41031699

efee22 1 year ago | | |

RHEL kernel.. right. Imho, I'd trust an upstream stable kernel far more than a RHEL one for production which has dozen of feature backports and an internal kABI to maintain.. granted RH has a QA team, but it is still impossible to test everything beforehand.

blinkingled 1 year ago |

Ok. But the good old push code to staging / canary it before mainstream updates was a simpler way of solving the same problem.

Crowdstrike knows the computers they're running on, it is trivial to implement a system where only few designated computers download and install the update and report metrics before the update controller decides to push it to next set.

Archelaos 1 year ago | |

It would mitigate the problem, but not solve it. You can still imagine a condition that only occurs after the update has been rolled out everywhere. Furthermore, such a bug would still be extremely problematic for the concerned customers, even if not all of them were affected. In addition, it would be necessary to react very quickly in the case of zero-day vulnerabilities.

blinkingled 1 year ago | | |

Yes, I am not arguing against having the ability to deal with it quickly - I am saying canary/ staging helps you do exactly that. Because as we see in the case of Intel CPUs and Crowdstrike some problems or scale of some problems is best prevented.

tantalor 1 year ago | | |

(semantic argument warning)

"Mitigation" is dealing with an outage/breakage after it occurs, to reduce the impact or get system healthy again.

You're talking about "prevention" which keeps it from happening at all.

Canarying is generic approach to prevention, and should not be skipped.

Avoiding the risk entirely (eBPF) would also help prevent outage, but I think we're deluding ourselves to say it "solves" the problem once and for all; systems will still go down due to bad deploys.

rldjbpin 1 year ago | |

with the way they handled the debian crashing a little while ago, frankly they are happy to still go ahead with testing this way. still much better way to handle things than pushing to everybody at the same time.

phartenfeller 1 year ago | |

Why trust somebody else not messing up? With that in place for windows and crowdstrike billions of dollars would be saved and many lives not negatively impacted ...

skywhopper 1 year ago |

The implicit assumption of the article is that eBPF code can't crash a kernel, but the article itself eventually admits that it can and has done, including last month. eBPF is a safer way of providing kernel-extension functionality, for sure, but presenting it as the perfect solution is just asking to have your argument dismissed. eBPF is not perfect. And there's plenty of things it can't do. The very sandbox rules that limit how long its programs may run and what they can do also make it entirely inappropriate for certain tasks. Let's please stop pretending there's a silver bullet.

efee22 1 year ago | |

It's not a silver bullet, however, it is still better to pushing all the panicable bugs into one community-maintained section (e.g. eBPF verifier). All vendors have an incentive to help get right and this is much better than every vendor shipping their own panicable bugs in their own out of tree kernel modules. Additionally, it's not just the industry looking at eBPF, but also academia in terms of formally verifying these critical sections.

lazycog512 1 year ago |

"The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at and repair."

- Douglas Adams

nkozyra 1 year ago |

I don't do any kernel stuff so I'm out of my element, but doesn't the fact that Crowdstrike & Linux kernel eBPF already caused kernel crashes[1] sort of downplay the rosiness of the state of things?

[1]: https://access.redhat.com/solutions/7068083

guipsp 1 year ago | |

This is specifically addressed in the post you are replying to

nkozyra 1 year ago | | |

Can you elaborate? What I see about Linux is that Crowdstrike was in the process of adopting eBPF which is ostensibly immune to kernel panics, but that issue shows their eBPF implementation specifically causing a kernel panic.

kjellsbells 1 year ago |

Lets suppose that eBPF solves this particular problem, eventually, for Windows. Doesn't sidestepping the entire class of Crowdstrike-style fubars require that Microsoft then mandate that no, backward compatibility will not be offered?

Back compat seems to be such a shibboleth in the Windows world, but comes at an incredible price. The reasons cited all seem to boil down to keeping some imagined customers' obscure LOB app running for decades. But that seems like an excuse to me. Surely Microsoft would like to shake out the last diehards running some VB5 app on a patched up PC in a factory. Isn't it more beneficial to everyone to start sunsetting acres of ancient NT code and approaches and streamline the entire attack surface?

acdha 1 year ago | |

Backwards compatibility slows things down in the Windows world but it doesn’t halt improvements. In this case, there are two powerful ratchets:

1. Compliance: everyone affected by this bug has auditors. Once safer alternatives are available, the standards like CIS, PCI, etc. will be updated to say you should use the new interface, and every enterprise IT department will have pressure to switch to eBPF tools. We saw this with BootLocker: storage encryption used to be a pain, people resisted it, but over time it became universal because the cost of swimming upstream was too high.

2. Signing. Microsoft can start requiring more proof of need and restrictions for signing drivers. They have to be careful to avoid the appearance of favoritism but after this debacle that’s a LOT easier. I would bet some engineer is working on a draft of mandatory fault handling and testing proof requirements for critical kernel drivers now and I would not be surprised to see it include a timeframe for adopting memory-safe languages.

another2another 1 year ago | |

>Surely Microsoft would like to shake out the last diehards running some VB5 app on a patched up PC in a factory. >Isn't it more beneficial to everyone to start sunsetting acres of ancient NT code and approaches and streamline the entire attack surface?

If your code somehow still relies on some buggy behaviour to work, then MS shouldn't do anything to preserve that anymore - apparently they used to, but I'm not so sure nowadays.

However 'ancient NT' code should probably still function just fine since the Win32 API hasn't changed much for a while, and MS don't actively deprecate function calls (unlike Apple who seem to do it a bit on a whim recently). I would put this down to the API being pretty well designed in the first place.

pas 1 year ago | |

it would be enough if MS offered knobs and switches for admins/devs/vendors to disallow non-static-verified stuff in the kernel

xyzzy123 1 year ago |

So many problems though! including commercial monocultures, lack of update consent, blast radius issues, etc etc. There's a commons in our pockets but that is very difficult to regulate for. The will keep putting the gun to your head until you keep choosing the monoculture.

shahahqq 1 year ago | |

worrisome indeed that now the world knows how many users are affected by crowdstrike so the bad guys just need to poke deeper there

titzer 1 year ago |

WebAssembly is a better choice for sandboxing kernel code. It has a full formal specification with a mechanized proof of type safety, many high-performance implementations, broad toolchain support, is targetable from many languages, and a capability security model.

rapidlua 1 year ago | |

Hardly. For starters, wasm doesn’t guarantee that a piece of code terminates in bound time. There are further security guarantees in ebpf such as any lock acquired must be released.

jules 1 year ago | | |

The eBPF termination checker is buggy anyway; you cannot rely on it.

titzer 1 year ago | | |

You can apply additional static checks to Wasm, e.g. control flow analysis, and reject programs without obvious loop bounds or unbalanced locking operations. Or you could apply dynamic techniques like tracking acquired locks and automatically releasing them, or charging fuel (gas). The latter is quite common for blockchain runtimes based on Wasm.

3np 1 year ago |

> The worst thing an eBPF program can do is to merely consume more resources than is desirable, such as CPU cycles and memory.

This is obviously not true. It might be the worst it can do, by itself, to the currently running kernel. It's not the worst it can do to the machine or its user(s).

There are infinite harmful things an eBPF program can do. As can programs solely in user-space. There is a specific class of vulnerabilities being mitigated by moving code from kernel to BPF. That does not mean that eBPF programs are in general safe.

usrme 1 year ago |

Does anyone know how far along the eBPF implementation for Windows actually is? In the sense that it could start feasibly replacing existing kernel drivers.

tgtweak 1 year ago |

Even if Microsoft rolls out eBPF and mainstreams it - it will be years before everything is ported over and it still won't address legacy windows versions (which appear to be a good chunk of what was impacted).

It's a move in the right direction but it probably won't fully mitigate issues like this for another 5+ years.

acdha 1 year ago | |

Sure, but 5 years is not that long ago - for example, if they’d started right before the pandemic it’d be almost done by now. The best time to have done that was 5 years ago but the second best time is now.

CodeWriter23 1 year ago |

> an unprecedented example of the inherent dangers of kernel programming

I take issue with that. Kernel programming was not to blame; looking up addresses from a file and accessing those memory locations without any validation is. The same technique would yield the same result at any Ring.

lucianbr 1 year ago | |

Obviously in userspace it would only crash the running program and not the entire operating system? It's a significant difference.

All of the service interruptions would have been just "computer temporarily not protected by crowdstrike agent". Not the same thing at all.

chrisjj 1 year ago | | |

> Obviously in userspace it would only crash the running program and not the entire operating system? It's a significant difference.

Significant and often far worse. It would leave the machine running unprotected.

CodeWriter23 1 year ago | | |

> It's a significant difference.

When various apps running the world are crashing, unable to execute because malware protection is failing, there is no difference.

nine_k 1 year ago | |

At Ring 3 it would crash an app, not the entire OS.

Yes, the kernel is fine and is not to blame. But running basically a rootkit controlled by a third party indeed is to blame.

CodeWriter23 1 year ago | | |

> At Ring 3 it would crash an app, not the entire OS.

That's still an outage for those key systems.

dwattttt 1 year ago | |

FWIW their configuration files can't be holding addresses; those have been randomised in the kernel for at least a decade

twen_ty 1 year ago |

Can someone tell me what's the advantage of eBPF over a user mode driver? The article makes it look it eBPF is have your cake and eat it too solution which is too good to be true? Can you run graphics drivers in eBPF for example?

Yawrehto 1 year ago |

1. How does eBPF solve this? It makes it more difficult, sure, but it'll almost always be possible to cause a crash, if you try hard enough. 2. More importantly, the problem is rarely fixable by changing technology, because typically, problems are caused by people and their connections: social/corporate pressures, profit-seeking, mental health being treated as unimportant, et cetera. eBPF can't fix those, and as long as corporations have social structures that penalize thoroughness and caution, and incentivize getting 'the most stuff' done, this will persist as a problem.

umanwizard 1 year ago | |

> it'll almost always be possible to cause a crash, if you try hard enough.

If you think you know a way to crash the Linux kernel by loading and running an eBPF program, you should report a bug.

tracker1 1 year ago |

I don't buy it... didn't a bug from RedHat + Crowdstrike have a similar panic issue? I understand in that case it was because of RedHat, but still. I don't think this, by itself will change much.

WaitWaitWha 1 year ago |

eBPF == extended Berkeley Packet Filter

https://en.wikipedia.org/wiki/Berkeley_Packet_Filter

kayge 1 year ago | |

Thanks! This was not a familiar acronym to me... and after some digging[0] apparently it's no longer an acronym:

"BPF originally stood for Berkeley Packet Filter, but now that eBPF (extended BPF) can do so much more than packet filtering, the acronym no longer makes sense. eBPF is now considered a standalone term that doesn’t stand for anything."

[0] https://ebpf.io/what-is-ebpf/

dveeden2 1 year ago |

So eBPF is giving us eBFP (enhanced Blue Friday Protection)?

mschuster91 1 year ago |

> If your company is paying for commercial software that includes kernel drivers or kernel modules, you can make eBPF a requirement. It's possible for Linux today, and Windows soon. While some vendors have already proactively adopted eBPF (thank you), others might need a little encouragement from their paying customers.

How about Microsoft's large government and commercial customers make it a requirement that MS does not develop a single new feature for the next two fucking years or however long it takes to go through the entirety of the Windows+Office+Exchange code base and to make sure there are no security issues in there?

We don't need ads in the start menu, we don't need telemetry, we don't need desktop Outlook becoming a rotten slow and useless web app, we don't need AI, we certainly don't need Recall. We need an OS environment that doesn't need a Patch Tuesday where we have to check if the update doesn't break half the canary machines.

And while MS is at that they can also take the goddamn time and rework the entire configuration stack. I swear to god, it drives me nuts. There's stuff that's only accessible via the registry (and there is no comprehensive documentation showing exactly what any key in the registry can do - large parts of that are MS-internal!), there's stuff only accessible via GPO, there's stuff hidden in CPLs dating back to Windows 3.11, and there's stuff in Windows' newest UI/settings framework.

jeffrallen 1 year ago |

Here's an idea for an interesting hack: a piece of kernel resident code that feeds fake data into eBPF so that an eBPF-based antimalware will see nothing bad as the malware goes about it's merry way.

Sandboxes are safe, but are ultimately virtual machines, and virtual machines can be made to live in a world that's not real.

yubiox 1 year ago |

Title reminds me of when microsoft promised no more UAEs back in 92. They just renamed them to GPFs in windows 3.1.

egorfine 1 year ago |

One option to prevent this is to not run corporate spyware. But I guess for some industries this isn't an option.

supriyo-biswas 1 year ago | |

I don’t understand statements like this. You only need to have some employee install some malware (unintentionally or otherwise); and you have a data breach on your hands.

egorfine 1 year ago | | |

I agree it's much more scalable to have a vendor install a spyware on all your workstations and have a centralized data breach.

datadeft 1 year ago |

It is great that we need a linux kernel feature to be ported to Windows so we don’t have blue Fridays

CoastalCoder 1 year ago |

> If your company is paying for commercial software that includes kernel drivers or kernel modules, you can make eBPF a requirement.

Are they saying that device drivers should be written in eBPF?

Or maybe their drivers should expose an eBPF API?

I assume some driver code still needs to reside in the actual kernel.

prmoustache 1 year ago | |

These tool wouldn't need kernel drivers, only to target the eBPF userspace API: https://www.kernel.org/doc/html/latest/userspace-api/ebpf/in...

wiresurfer 1 year ago |

Hey Brendan,

> If your company is paying for commercial software that includes kernel drivers or kernel modules, you can make eBPF a requirement.

Windows soon, may still be atleast a year ahead. Would that be a fair statement? atleast being the operating keyword here.

Specifically in the context of network security software, for eBPF programs to be portable across windows/linux, we would need MSFT to add a lot more hooks and expose internal kernel stucts. Hopefully via a common libbpf definition. Otherwise, I fear, having two versions of the same product, across two OSs would mean more secuirty and quality issues.

I guess the point I am trying to make is, we would get there, but we are more than a few years away. I would love to see something like cilium on vanilla windows for a Software defined Company Wide network. We can then start building enterprise network secutiry into it. Baby steps!

---

btw, your talks and blog posts about bpftools is godsent!

vfclists 1 year ago |

Yep, another fix to all our problems, a new bandwagon to be jumped on by wall EDR vendors, until ...

Here I am using the term "EDR". Until this CrowdStrike debacle I'd never heard it.

Only tells how seriously you should take my opinions.

throw0101d 1 year ago |

Meta:

> eBPF (no longer an acronym) […]

Any reason why the official acronym was done away with?

riskable 1 year ago | |

Because it used to stand for extended Berkeley Packet Filter and it has since moved far, far beyond just packets. It now hooks into the entire network stack, security, and does observability/tracing for nearly anything and everything in the kernel ("nearly" because some stuff runs when the kernel boots up--before eBPF is loaded--and never again after that).

sandywaffles 1 year ago | |

Because eBPF is no longer just packet filtering? It's now used in loads of hook pionts unrelated to packets or filtering at all.

Jedd 1 year ago | |

Technically it was never an acronym - rather an initialism or abbreviation.

ninju 1 year ago |

So a couple of questions

1) Is CrowdStrike Falcon using eBPF for their Linux offering?

2) Would the faulty patch update get caught by the eBPF verifier?

rezonant 1 year ago |

> the company behind this outage was already in the process of adopting eBPF, which is immune to such crashes

Oh I'm sure they'll find a way.

fullspectrumdev 1 year ago |

This puts an awful lot of stock in the robustness of eBPF.

Which is odd, given there’s been a bunch of kernel privesc bugs using eBPF…

0xbadcafebee 1 year ago |

> In the future, computers will not crash due to bad software updates

I'm still waiting on my flying car...

ksec 1 year ago |

The article mentions Windows and Linux. Does anyone know if there will be eBPF for FreeBSD?

Scene_Cast2 1 year ago |

How much extra security does this provide on top of HLK?

userbinator 1 year ago |

In the future, computers will not crash due to bad software updates, even those updates that involve kernel code.

100% BS. Even if they don't "crash" they will "stop functioning as intended" which is just the same. It's absolutely disgusting how this industry is now using this one outage as a talking point to further their totalitarian agenda.

It reminds me of how Google went after adblockers with their new extension model that also promised more "security". It's time we realised what they're really trying to do. In fact, I wonder whether this outage was not accidental after all.

klooney 1 year ago |

First io_uring, now eBPF. Kind of wild.

asynchronous 1 year ago |

Is there a reason for the lack of naming+shaming Crowdstrike in this blogpost? Was it to not give them any more publicity, good or bad?

StevenWaterman 1 year ago | |

If you consider kernel programming to be inherently unsafe, then you would consider this to be inevitable, meaning it's not really the specific company's fault. They were just the unlucky ones.

brendangregg 1 year ago | | |

Right, and we wanted to talk about all security solutions and not make this about one company. We also wanted to avoid shaming since they have been seriously working on eBPF adoption, so in that regard they are at the forefront of doing the right thing.

lordnacho 1 year ago | | |

They could have helped their luck by doing some of the common sense things suggested in the article.

For instance, why not find a subset of your customers that are low risk, push it out to them, and see what happens? Or perhaps have your own fleet of example installations to run things on first. None of which depends on any specific technology.

efee22 1 year ago | | |

Agree, Crowdstrike was an unlucky one, but it is more about the issue in general. If I remember correctly, also others like sysdig user their own kernel modules for collection.

asynchronous 1 year ago | | |

I still hold true that testing even improperly would have caught this before it hit worldwide. But I suppose you are right, that doesn’t help the argument being made here.

hiddencost 1 year ago | |

I think the article isn't about crowd strike. It's about ebpf.

pimlottc 1 year ago | | |

The second paragraph is 100% about Crowdstrike. It even links to the Wikipedia article:

https://en.m.wikipedia.org/wiki/2024_CrowdStrike_incident

7e 1 year ago |

eBPF will be an improvement, I’m sure, but does not mean the end of bugs/DoS in software.

odyssey7 1 year ago |

"The verifier is rigorous"

But the appeal-to-authority evidence that the article presents is not.

"-- the Linux implementation has over 20,000 lines of code -- with contributions from industry (e.g., Meta, Isovalent, Google) and academia (e.g., Rutgers University, University of Washington). The safety this provides is a key benefit of eBPF, along with heightened security and lower resource usage."

ReleaseCandidat 1 year ago |

Sorry, but neither eBPF nor Rust nor formal verification nor ... is going to solve that problem. Repeat after me: there are no technical solutions to social problems. As long as the result of such an outage is basically a "oh, a software problem! shrug", _nothing_ will change.

bfrog 1 year ago |

I wonder if microkernels ever had this kind of bullshit. Had it been a microkernel, would we all be sitting twiddling our thumbs on friday? Hot take: No.

shrx 1 year ago |

From the article:

> If the verifier finds any unsafe code, the program is rejected and not executed. The verifier is rigorous -- the Linux implementation has over 20,000 lines of code [0] -- with contributions from industry (e.g., Meta, Isovalent, Google) and academia (e.g., Rutgers University, University of Washington).

[0] links to https://github.com/torvalds/linux/blob/master/kernel/bpf/ver... which has this interesting comment at the top:

    /* bpf_check() is a static code analyzer that walks eBPF program
     * instruction by instruction and updates register/stack state.
     * All paths of conditional branches are analyzed until 'bpf_exit' insn.
     *
     * The first pass is depth-first-search to check that the program is a DAG.
     * It rejects the following programs:
     * - larger than BPF_MAXINSNS insns
     * - if loop is present (detected via back-edge)
    ...

I haven't inspected the code, but I thought that checking for infinite loops would imply solving the halting problem. Where's the catch?

lolinder 1 year ago | |

I'm not able to comment on what this code is doing, but as for the theory:

The halting problem is only unsolvable in the general case. You cannot prove that any arbitrary piece of code will stop, but you can prove that specific types of code will stop and reject anything that you're unable to prove. The trivial case is "no jumps"—if your code executes strictly linearly and is itself finite then you know it will terminate. More advanced cases can also be proven, like a loop over a very specific bound, as long as you can place constraints on how the code can be structured.

As an example, take a look at Dafny, which places a lot of restrictions on loops [0], only allowing the subset that it can effectively analyze.

[0] https://ece.uwaterloo.ca/~agurfink/stqam/rise4fun-Dafny/#h25

jkrejcha 1 year ago | | |

Adding on (and it's not terribly relevant to eBPF), it's also worth noting that there are trivial programs you can prove DON'T halt.

A trivial example[1]:

    int main() {
        while (true) {}
        int x = foo();
        return x;
    }

This program trivially runs forever[2], and indeed many static code analyzers will point out that everything after the `while (true) {}` line is unreachable.

I feel like the halting problem is incredibly widely misunderstood to be similar to be about "ANY program" when it really talks about "ALL programs".

[1]: In C++, this is undefined behavior technically, but C and most other programming languages define the behavior of this (or equivalent) function.

[2]: Fun relevant xkcd: https://xkcd.com/1266/

Retr0id 1 year ago | |

The halting problem cannot be solved in the general case, but in many cases you can prove that a program halts. eBPF only allows verifiably-halting programs to run.

dathinab 1 year ago | |

the halting problem is only true for _arbitrary_ programs

but there are always sets of programs for which it is clearly possible to guarantee their termination

e.g. the program `return 1+1;` is guaranteed to halt

e.g. given program like `while condition(&mut state) { ... }` with where `condition()` is guaranteed to halt but otherwise unknown is not guaranteed to halt, but if you turn it into `for _ in 0..1000 { if !condition(&mut state) { break; } ... }` then it is guaranteed to halt after at most 1000 iterations

or in other words eBPF only accepts programs which it can proof will halt in at most maxins "instruction" (through it's more strict then my example, i.e. you would need to unroll the for-loop to make it pass validation)

the thing with programs which are provable halting is that they tend to also not be very convenient to write and/or quite limited in what you can do with them, i.e. they are not suitable as general purpose programming languages at all

efee22 1 year ago | |

Infinite loops are not possible and would get rejected by the verifier since it cannot solve the halting problem. Here is a good overview on the options available: https://ebpf-docs.dylanreimerink.nl/linux/concepts/loops/

pkhuong 1 year ago | |

The basic logic flags any loop ("back-edge").

rezonant 1 year ago | | |

This, others have said it less concisely, but a program without loops and arbitrary jumps is guaranteed to halt if we assume the external functions it calls into will halt.

umanwizard 1 year ago | |

eBPF is not Turing complete. Writing it is very annoying compared to writing normal C code for exactly this reason.

aksdlf 1 year ago | |

I'm glad to hear that Meta and Google code is "rigorous". I'd prefer INRIA, universities that fund theorem provers, industries where correctness matters like aerospace or semiconductors.

chc4 1 year ago | | |

Windows doesn't use the Linux eBPF verifier, they have their own implementation named PREVAIL[0] that is based on an abstract interpretation model that has formal small step semantics. The actual implementation isn't formally proven, however.

0: https://github.com/vbpf/ebpf-verifier

SoftTalker 1 year ago | | |

Also that lines of code is a proxy for rigor, something new I learned today. /s

atrus 1 year ago | |

The halting problem is exhaustive, there isn't an algorithm that is valid for all programs. You can still check for some kinds of infinite loops though!

roywiggins 1 year ago | | |

More specifically, you can accept a set of programs that you are certain do halt, and reject all others, at the expense of rejecting some that will halt. As long as that set is large enough to be practical, the result can be useful. If you eg forbid code paths that jump "backwards", you can't really loop at all. Or require loops to be bounded by constants.

dtx1 1 year ago | |

I have no insight into this particular project but you could work around the halting problem by only allowing loops you can proof will not go infinite. That would of course imply rejecting loops that won't go infinite but can't be proven not to.

skywhopper 1 year ago | |

If the verifier can't determine that the loop will halt, the program is disallowed. Also, if the program gets passed and then runs too long anyway, it's force-halted. So... I guess that solves the halting problem.

neaanopri 1 year ago | | |

It's more accurate to say that in principle, there could be programs that would halt, but that the verifier will deny.

lucianbr 1 year ago | | |

So this "solves" the halting problem by creating a new class "might-not-halt-but-not-sure" and lumping it with "does-not-halt". I find it hard to believe the new class is small enough for this to be useful, in the sense that it will avoid all kernel crashes.

I rather expect useful or needed code would be rejected due to "not-sure-it-halts", and then people will use some kind of exception or not use the verifier at all, and then we are back to square one.

red_admiral 1 year ago | |

eBPF is not Turing-complete, I suppose.

javierhonduco 1 year ago | | |

It is not, programs that are accepted are proved to terminate. Large and more complex programs are accepted by BPF as of now, which might give the impression that it's now Turing complete, when it is definitely not the case.

lizxrice 1 year ago | | |

In this talk we demo Conway's Game of Life implemented in eBPF: https://www.youtube.com/watch?v=tClsqnZMN6I

ahepp 1 year ago | |

If you’re wrong about the loop, you’ll still hit BPF_MAXINSNS, so it’s fine to use heuristics that could produce a false negative right?

hiddencost 1 year ago | |

Unterminated loops might be a better phrasing.

joker99 1 year ago |

I used to work for an EDR vendor and this post glosses over two major and important things. 1. There’s no need for eBPF on windows, it has the ETW framework (event tracing) which is much more powerful and provides applications subscribing to a class of events almost too detailed insights. the issue most AV vendors have with it though is speed. Leading to … 2. eBPF lets you watch. Congrats. It’s something, but it’s not the reason why these tools are deployed. Orgs deploy these tools to prevent or stop potentially bad stuff from executing. The only place this can be done in our operating systems is usually the kernel - for that you need kernel level drivers or various other filter drivers.

Crowdstrike screwed the pooch here, yes. But after a couple of days I feel like I haven’t read enough blog posts and articles that crap on Microsoft. It’s their job to build a secure operating system, instead they deliver Windows and because they themselves cannot secure windows, they ship defender… and we use tools like falcon like a bandaid for Microsofts bad security practices

thayne 1 year ago | |

> eBPF lets you watch. Congrats. It’s something, but it’s not the reason why these tools are deployed. Orgs deploy these tools to prevent or stop potentially bad stuff from executing

eBPF let's you prevent things too. seccomp filters can block syscalls.

The bigger problem is the performance you mentioned in 1. Crowdstrike's linux agent can work using eBPF instead of a kernel module, and will fall back to that if the current kernel version is more recent than the agent supports. But... then it uses up a lot more CPU.

risenshinetech 1 year ago |

Thank God some superheros have finally come along to make sure code never crashes any computers ever again! /s

chrisjj 1 year ago | |

Sometimes a point is such that only sarcasm can get it across sufficiently. This is such a point.