Meltdown Proof-of-Concept(github.com) |
Meltdown Proof-of-Concept(github.com) |
by the way, that PoC was intense. Makes you wonder if the NSA knew about it all along :)
Colin Percival found a very similar issue with Intel's implementation of SMT on the Pentium 4 in 2005: http://www.daemonology.net/papers/htt.pdf
So the general idea of using timing attacks against the cache to leak memory has been known for at least that long.
In 2016, two researchers from the University of Graz gave a talk at the 33C3, where they showed that they had managed to use that technique to establish a covert channel between VMs running on the same physical host. They even managed to run ssh over than channel. https://media.ccc.de/v/33c3-8044-what_could_possibly_go_wron...
In light of that, I would be surprised if the NSA had not known about this.
Unlike "vanilla" cache-timing attacks:
* Meltdown and Spectre involve transient instructions, instructions that from the perspective of the ISA never actually run.
* Spectre v1 undermines the entire concept of a bounds check; pre-Spectre, virtually every program that runs on a computer is riddled with buffer overreads. It's about as big a revelation as Lopatic's HPUX stack overflow was in 1995. There might not be a clean fix! Load fences after ever bounds check?
* Spectre v2 goes even further than that, and allows attackers to literally pick the locations target programs will execute from. Try to get your head around that: we pay tens of thousands of dollars for vulnerabilities that allow us to return to arbitrary program locations, and Spectre's branch target injection technique lets us use the hardware to, in some sense, do that to any program. And look at the fix to that: retpolines? Compilers can't directly emit indirect jumps anymore?
It's good that we're all recognizing how big a problem cache timing is. It was for sure not taken as seriously as it should have been outside of a subset of cryptographers. But Meltdown and Spectre are not simply cache timing vulnerabilities; they're a re-imagining of what you can do to a modern ISA by targeting the microarchitecture.
Call me a tinfoil hat conspiracist but the only rational explanation I can find of IBM POWER and z CPUs still vulnerable to Spectre is the NSA forcing IBM not to fix it. I read somewhere that the z196 had three magnitudes more validation routines than the Intel Core at that time. It's extremely hard to believe they haven't caught this.
Former head of TAO Rob Joyce said "NSA did not know about the flaw, has not exploited it and certainly the U.S. government would never put a major company like Intel in a position of risk like this to try to hold open a vulnerability." [1]
Who knows if that's true or not, though. Certainly the U.S. government has done exactly that many times in the past (like with heartbleed).
[1]: https://www.washingtonpost.com/business/technology/huge-secu...
The claim that "the U.S. government would never put a major company like Intel in a position of risk" is obviously bullshit. TAO's job necessarily involves exposing companies both in the US and overseas to that kind of risk on a daily basis.
> U.S. government would never put a major company like Intel in a position of risk like this to try to hold open a vulnerability." [1]
They subverted the Dual_EC_DRBG standardization process. Had they not been caught and the algorithm ended up on more devices they would be hurting not just major companies but whole industries.
Also for reference: https://en.wikipedia.org/wiki/Bullrun_(decryption_program)
Note that it talks about "the flaw", whereas Intel claims it isn't a "flaw". So could be another instance of overly specific denial. "We didn't exploit this flaw, because it isn't a flaw. We exploited the processor operating as designed".
</tinfoil>
The US government sure. The NSA? They sure would as this statement shows.
I believe one solution would be to put permission checks before the memory access, which would add serialized latency to all memory access. Another would be to have the speculative execution system flush cache lines that were loaded but ultimately ignored, which would be complex but probably not be as much of a speed hit.
(edit: yeah, a simple "flush" is insufficient, it would have to be closer to an isolated transaction with rollback of the access's effects on the cache system.)
I don't see why that would have to add latency to all (or any) memory access. The addresses generated by programs (except in real mode, when everything has access to everything anyway so we don't care about these issues then) are virtual addresses, so they have to be translated to get the actual memory address.
The permission information for a page is stored in the same place as the physical address translation information for that page. The processor fetches it at the same time it fethes the physical base address of the page.
They should also have the current permission level of the program readily available. That should be enough to let them do something about Meltdown without any performance impact. They could do something as simple as if the page is a supervisor page and the CPU is not in supervisor mode don't actually read the memory. Just substitute fixed data.
Note that AMD is reportedly not affected by Meltdown. From what I've read that is because they in fact do the protection check before trying to access the memory, even during speculation, and they don't suffer any performance loss from that.
Note that since Meltdown is only an issue when the kernel memory read is on the path that does NOT become the real path (because if it becomes the real path, then the program is going to get a fault anyway for an illegal memory access), the replacing of the memory accesses with fixed data cannot harm any legitimate program.
Spectre is going to be the hard one for the CPU people to fix, I think. I think they may have to offer hardware memory protection features that can be used in user mode code to protect parts of that code from other parts of that code, so that things that want to run untrusted code in a sandbox in their own processes can do so in a separate address space that is protected similar to the way kernel space is protected from user space.
It may be more complicated than that, though, because Spectre also does some freaky things that take advantage of branch prediction information not being isolated between processors. I haven't read enough to understand the implications of that. I don't know if that can be defeated just be better memory protection enforcement.
I guess compilers could pad that out with noops to postpone the read until the previous commit is done if they know the design of the pipeline they are targetting. But generically optimized code would take a terrible hit from this.
* https://newsroom.intel.com/wp-content/uploads/sites/11/2018/... (https://news.ycombinator.com/item?id=16079910)
Edit: Probably the 'extreme circumstances' bit mentioned in https://news.ycombinator.com/item?id=16108434
> Project Member Comment 4 by hawkes@google.com, Aug 7
> Labels: Deadline-Grace
It looks like Ben Hawkes would know the reason why, but I think the speculation that this grace period was done due to the scope and severity of this finding is likely correct.
from: http://xlab.tencent.com/special/spectre/spectre_check.html
https://twitter.com/mlqxyz/status/950378419073712129
(I personally do not have a twitter account but was looking for the paper and stumbled upon it, glad I did!)
To test, set CONFIG_PAGE_TABLE_ISOLATION=y. That is:
sudo apt-get build-dep linux
sudo apt-get install gcc-6-plugin-dev libelf-dev libncurses5-dev
cd /usr/src
wget https://git.kernel.org/torvalds/t/linux-4.15-rc7.tar.gz
tar -xvf linux-4.15-rc7.tar.gz
cd linux-4.15-rc7
cp /boot/config-`uname -r` .config
make CONFIG_PAGE_TABLE_ISOLATION=y deb-pkgTrying the kaslr program right now, it's not figuring out the direct map offset and it's probably already been a minute or two. So it works?
EDIT: After 40 minutes, it has attempted all addresses and did not find the direct map offset.
I think that the page isolation slows it down, even if it doesn't completely eliminate it.
The second test had something like a 0.05% success rate on my PC, and took over an hour to get a few dozen values read.
After trying this with the new kernel, I started up an AWS instance and ran the tests there. The first test (KASLR) succeeded within a few seconds, and the second test had a 100% success rate (read 1575 values in a few seconds).
This code is from TU Graz; I assume this is from Daniel Gruss's team, who participated in the original research.
I understood the "secret" data stays in the caches for a very short time until the branch prediction is rolled back, which makes this a timing attack but don't get how you actually read it.
EDIT
So perhaps someone can ELI5 me "4.2 Building a Covert Channel" [1] from the Meltdown paper which is what I didn't understand.
I thought the recent kernel-/firmware-/ucode-patches should have prevented that.
EDIT: The other demos fail, though, as they should. sigh
EDIT: For some reason, demo #2 (breaking kaslr) works on my Ryzen machine, but not on the others. :-?
>"reports this morning that Intel chief executive Brian Krzanich made $25 million from selling Intel stock in late November, when he knew about the bugs, but before they were made public" (https://qz.com/1171391/the-intel-intc-meltdown-bug-is-hittin...)
I assume he's supposed to now be prosecuted, that sounds like insider dealing? [I'd like to say "will be prosecuted" but ...]
First, the "Direct physical map offset" comes back wrong in Demo #2. Second, if I use the correct offset, the reliability is around 0.5% in Demo #3 - but not consistently... after a few tries it did come back with >99%
Basically, screw up your caches continuously.
Does it mean a hacked IOS/Android app can also (in theory) sniff the password enter in system dialog as demo in the video?
Realtime password input - https://www.youtube.com/watch?v=yTpXqyRYcBMI've only seen two implementations: one based just doing the access to kernel memory, catching the SIGSEGV, and then probing the cache. Obviously that could be closed by the kernel flushing the cache prior handing control back t user space after SIGSEGV. Doing that would have no impact on normal programs.
The second is by exploiting a bug in Intel's transactional memory implementation. But I assume Intel could turn that feature off as they have done so in the past. Since bugger all programs use it doing so wouldn't have much impact.
Which means the approach being take now is done purely to kill the speculative branch method (ie, Spectre pointed at the kernel). The authors say it should work, but also say they could not make it work. I haven't been able to find working any PoC for my Linux machines.
So my question is: is there any out there?
#2 - physical memory leak - https://www.youtube.com/watch?v=kn0FopiF16o
the videos aren't very long, someone should compress it to <10mb as an animated gif and do a pull request to put it in the README
There's no need to use an awful format like gif, just embed an efficiently compressed video file with the <video> tag
You can force values from any memory to affect the cache in a predictable manner which enables you to read all physical memory. See https://news.ycombinator.com/item?id=16108574 or read the paper yourself https://meltdownattack.com/meltdown.pdf
We believe that this precondition is that the targeted kernel memory is present in the L1D cache.
Not only is L1D tiny, but stuff like prefetch doesn't touch it. So how exactly do you force any memory into L1D cache unless, like in all the examples we have seen, the victim program is pretty much accessing it in a busy loop?
Is there a direct method for that or do you mean that you can repeatedly try reading memory addresses until the address that you want to access is actually in the cache prior to your access?
if(false_but_predictive_execution_cant_tell)
{
int i = (int)*protected_kernel_memory_byte;
load_page(i, my_pages);
}
Then it becomes a matter of checking speed of reading from those pages. Which ever one is too fast to be loaded when read must be the value read from protected memory.My understanding is that the problem is that the data in the cache _isn't_ rolled back.
You fetch the secret data. You then fetch a different memory addressed based on the contents of the secret data e.g. fetch((secret_bit * 128) + offset) [1] so if secret_bit is 0 it's fetched the memory at offset into the cache, if secret_bit is 1 it's fetched the memory at offset+128 into the cache.
After the speculative work is rolled back, the data that it fetched into the cache still remains. You then time how long it takes to fetch offset and offset+128. If offset comes back quickly, secret_bit was 0. If offset+128 comes back quickly, secret_bit was 1.
_That_ is where the timing attack part comes in: "timing attack" refers to using measurements of how long something took to glean information, not that you need to do it quickly.
[1] In reality you do it on the byte level and use &, but I wanted to keep it to guessing a single bit to make it simpler.
I was under the impression that there is no interface to read data from the CPU caches and that the cache is managed by the CPU itself only.
The covert channel consists of a "sender" and a "receiver". The receiver can't extract contents of L1 cache, but it can detect which pages were in cache by timing differences. So the sender encodes the secret data by fetching particular addresses calculated so that the receiver can afterwards recover the secret by verifying which page(s) were in cache.
In Meltdown attack, the sender consists of instructions controlled by you - e.g. x=memory_you_shouldn't_access; y=array[1000x] - and after an Intel processor notices that you shouldn't access that memory and rolls back the instructions (invalidating y and x), the 1000x location was already pulled to cache, and you can check - is array[1000] cached? is array[2000] cached? is array[142000] cached? to determine x.
In Spectre attack, the sender consists of code in the vulnerable application that happens to contain similar instructions. Spectre attack means that if your application anywhere contains code like if (parameter is in bounds) {x=array1[parameter]} (...possibly some other code...) y=array2[x], then any attacker that (a) runs on the same CPU and (b) can manipulate the parameter somehow can trick this code to process the path "protected" by 'if' and reveal random memory out of bounds of that array1. The difference from ordinary buffer overflow bugs is that code like that is normal, common and (in general) not a bug, since the instructions "don't get executed", and the vulnerability persists even if you validate all input.
as in, demo #2 is a working exploit to get this map
They don't exactly leave behind a lot of telltale signs.
This is also the kind of bug that is so broad (read access to everything on almost any machine you can execute code on) that a large subset of those equipped to discover it would have kept their mouths shut.
> So you basically start with a broken system to exploit these bugs.
A lot of systems were broken in the time before KASLR came along
For a start - this is hardly a remote possibility when we already have proof of concepts like the linked repo.
Secondly - your analogy makes no sense. The only way to make it make sense is add that we also know there is an entire spacefaring group of mercenaries whose entire hobby and/or job is deliberately throwing asteroids in Earths general direction.
Maybe there is, but they are hilariously incompetent?
Note they do say
> This demo uses Meltdown to leak the (secret) randomization of the direct physical map. This demo requires root privileges to speed up the process. The paper describes a variant which does not require root privileges.
but I don't know how much allowing it to sudo speeds up the process.
You probably know this (saw you're the person I replied to initially), but for others reading this to check that it's on, "dmesg | grep isolation" should be able to tell you whether the page table isolation is on after you enable it in the kernel.
Given the other tests require the offset, I think I'm safe? I'm going to run it again just to be sure.
That's not to say that removing SharedArrayBuffer (and high-precision performance timers, which were removed a couple years back to mitigate some other timing-related vulnerabilities) is enough to completely eliminate Spectre; there might be other methods that can time accurately enough to reveal information.
(I might be completely wrong here, but this is my current understanding of the situation, at least.)
0: https://en.wikipedia.org/wiki/Kernel_page-table_isolation
Edit: See the archive[0] apparently I'm not going mad and it used to say that the patch was applied to Sierra and El Capitan, but Apple has since changed that.
0: https://web.archive.org/web/20180105102220/https://support.a...
So you read any address you want speculatively and then use the result to prime the cache in such a way that you can determine what the value you read speculatively was. This works because modern operating systems map kernel space addresses into normal processes and to make syscalls faster.
I'd recommend reading the paper[0], it's fascinating stuff.
https://www.cs.tau.ac.il/~tromer/papers/cache.pdf
IIRC, the only way to address the issue was the addition of the AES-NI instruction set, which came a few years later.
Another option would be to use a bitsliced implementation of AES, at some cost in speed. I could also imagine an implementation which read the whole table every time, using constant-time operations to select the desired element(s), but I don't know how slow that would be.
L1 caches are generally virtually indexed for exactly this reason: to allow a L1 cache read to happen in parallel with the TLB lookup. (They're also usually, I believe, physically tagged, so we have to check for collisions at some point, but making sure there's no side channel information at that point is, obviously given recent events, hard.)
Spectre is, as you say, harder - but more because the line of what sort of state should be separate isn't as clear-cut as we might like it to be (i.e. it's not neccessarily just "processes" as the OS sees it - e.g. JVM/JavaScript interpreter state should allow for an effective sandbox between the executing interpreter/JVM process and what the running JVM/JavaScript code can see). And worse, those are precisely the cases where one probably cares most about separation given that's where untrusted code is typically run.
But hardware assistance could help - in simple terms, I'd imagine that allowing a swap out of more of the internal processor state (to the extent that one process "training" the branch-predictor doesn't impact how the branch predictor acts in another process) would be pretty effective. That might be expensive in terms of performance per-transistor/per-watt however (though probably not absolute performance).
...sorry, what?
It makes you wonder if the NSA had chip makers incorporate speculative execution and caching because... timing attacks?
It's just that it's highly suspicious that anyone is making any type of mention of it at all.
Here's the example from the paper.
1: ; rcx = kernel address
2: ; rbx = probe array
3: retry:
4: mov al, byte [rcx] ; Read kernel memory(1 byte) into AL which is the least significant byte of RAX
5: shl rax, 0xc ; Multiply the collected value by 4096
6: jz retry ; Retry in case of zero to reduce noise
7: mov rbx, qword [rbx + rax] ; Access memory based on the value read on line 4
8: ; Note: The read on line 4 is illegal, but the CPU speculatively executes line 5-7 before this triggers a fault.
The receiving code then trys to access each of these 256 memory locations and measure the time taken. For one of them the value will be much lower since that memory is cached and thus that location is the value read. So if you read the value 84 on line 4 when you access the value at 344064dec(0x54000)in your memory it will be faster and you can deduce the read value was 84.So in pseudo code the attack is
start = 0xFFEE // No idea if this is a reasonable start location
result = []
offset = 0
page_size = 4096
probe_array = malloc(256 * page_size)
loop {
flush_caches(probe_array)
read_and_prepare_cache(start + offset * 8, probe_array) // The above assembly
result.push(determine_read_value(probe_array))
offset += 1
}
There's an extra detail here about recovering from the illegal memory access in a quick way that I've skipped.To answer the parents question I believe this only uses a single cache line(64 bits) since it only accesses a single value.
This is my understanding anyway, happy to be corrected
Animated gifs do work when embedded, but need to be <= 10mb [2]
[1] https://stackoverflow.com/questions/4279611/how-to-embed-a-v...
Yes, you can! :)
I get your point. From the perspective of somebody who normally does not deal with such low-level affairs, the difference to prior cache timing attacks is not /that/ obvious. It all looks like black magic to me, even after I roughly understand how it works.
But meanwhile, yes, can definitely agree how fucking cool it all is
A second instruction in the pipeline would read from the above mentioned L0 cache (let us call it load buffer), much like it would for tentative memory stores from the store buffer.
Also, two memory fetches in parallel are not twice as long as a memory fetch, if that would be the solution (which I guess would not be the case, as I imagine race conditions appearing)
For example say the memory address you want to look for being cached is either 0x100 or 0x200 (not realistic addresses but it works for example) based on some kernel memory bit. Then run instructions in userspace that try to fetch 0x100 (with flushes in between). If you notice one that completes quickly, then it must have used the value 0x100 cached in L0 cache by the kernel? (and also run over 0x200 to try and check when it's cached in L0)
So if the uOP populated the L0 was reading from kernel memory, then it won't be committed. And subsequent uOP read from the L0 won't be committed either. So you can't get timing information from them.
"Sir, I have a cunning plan" "Does it involve that legion of rabid space weasels again?" "... maybe."
If there are bugs that can be exposed through various machine code patterns, the compiler can centralize the restrictions of what may be executed, enforce runtime checks, or prevent certain instructions from being used at all. Security or optimization updates would affect all running programs automatically. Granted, these current speculative vulnerabilities would be much more difficult to statically detect.
But it would follow the crazy gentoo dream of having everything optimized for your environment better, allow much better compatibility across systems, and prevent entire classes of privilege escalation issues.
Having said that, AMD CPUs are the existence proof that you can be immune to meltdown with no significant overhead.
Spectre is a completely different issue though.
Perhaps that because they haven't taken the speed short-cuts that Intel took...?
* https://news.ycombinator.com/item?id=16086047
* https://news.ycombinator.com/item?id=16074531
* https://news.ycombinator.com/item?id=16075744
Moreover, the OpenBSD people have made some remarks about how it was commentaries in Linux patches and discussions on LWN that actually let the cat out of the bag this time.
* http://pythonsweetness.tumblr.com/post/169166980422/the-myst... (https://news.ycombinator.com/item?id=16046636)
Last year I was already hoping that ARM Chromebooks would become more popular but in reality you cannot find them in any retail store.
I believe it's high time the long history of anticompetitive actions by Intel end, and their near/effective monopoly in major market segments be regulated.
Just something to keep in mind, not something we need to litigate on this thread.
Project Zero evaluated and relaxed their disclosure policy after that incident as described here https://googleprojectzero.blogspot.com/2015/02/feedback-and-...
I see there's some extensions there (maximum of 14 days) but this bug would have probably been covered under "As always, we reserve the right to bring deadlines forwards or backwards based on extreme circumstances."
Thanks!
But yeah, protecting against it means implementing memory protection in more places in the CPU. More gates and the possibility of becoming a bottleneck.
Is that actually being done? The FreeBSD team got notified (late), the DragonFlyBSD, OpenBSD, NetBSD teams did not get notified. Matt, of course, seems to have a patch already.
So... basically re-inventing Java? :)
"Raw machine code bytes" aren't distributed but occur through the privileged JVM and its just-in-time compiler, the byte-code verifier enforces restrictions on what data-access patterns and where instructions can be used, the JVM for a particular OS has optimizations for that environment, and sandboxing (while imperfect) blocks some classes of privilege escalation issues.
Don't get me wrong, I'm not saying Java is perfect or that the underlying goal isn't good, I'm just happily amused by this sense of "everything old is new again."
Of course, Java certainly does have some higher level weaknesses as in the introspection API kerfuffle a while back, and is too locked into its Object Obsessed design for it to be a truly general purpose object code format.
A privileged process (the microcode) enforces restrictions and converts it to micro-ops which execute on the real processor.
I just don't want to see performance being decimated as a trade off for security, if at all possible.