X86 MMU fault handling is turing complete(github.com) |
X86 MMU fault handling is turing complete(github.com) |
What's happening here is that they're getting computation without executing any instructions, simply through the process of using the MMU hardware to "resolve addresses". The page directory system has been set up in such a way that address resolution effects a virtual machine that they can code to.
This works because when you attempt to resolve an invalid address, the CPU generates a trap (#PF), and the handling of that trap pushes information on the "stack". Each time you push data to the stack, you decrement the stack pointer. Eventually, the stack pointer underflows; when that happens, a different trap (#DF) fires. This mechanism put together gives you:
if x < 4 { goto b } else { x = x - 4 ; goto a }
also known as "subtract and branch if less than or equal to zero", also known as "an instruction adequate to construct a one-instruction computer".The virtual machine "runs" by generating an unending series of traps: in the "goto a" case, the result of translation is another address generating a trap. And so on.
The details of how this computer has "memory" and addresses instructions is even headachier. They're using the x86 TSS as "memory" and for technical reasons they get 16 slots (and thus instructions) to work with, but they have a compiler that builds arbitrary programs into 16-colored graphs to use those slots to express generic programs. Every emulator they could find crashes when they abuse the hardware task switching system this way.
Here's it running Conway's Life:
http://youtubedoubler.com/?video1=E2VCwBzGdPM&start1=0...
Here's their talk for a few months back:
http://www.youtube.com/watch?v=NGXvJ1GKBKM
The talk is great, but if you're not super interested in X86/X64 memory corruption countermeasures, you might want to skip the first 30 minutes.
Probably by patching the emulator.
By running it on a physical machine. Unless there is a requirement that the processor not be multitasking that I am missing.
Actually is it even true that they are not executing any x86 instructions on the CPU? From my understanding, the handling of page faults needs the CPU to execute some instructions. Maybe I'm wrong, if so could you enlighten me? Thanks.
I suppose if you setup everything very carefully you can make it fault over and over again without giving it the time to execute any instruction.
Without looking into the specifics, I think it's very possible that the CPU is not actually executing any instructions, just waiting for the MMU to get a hold of itself. After all, in order to simply load the instructions you need the MMU to be responsive (or deactivated I suppose, if there's such a thing as no-MMU x86).
The easiest way however would be to use the TrapCC mechanism to transfer control between bits of normal assembler code (perhaps repurposed from other functions already in your kernel), doing something similar to ROP. Of course, for additional fun, feel free to throw in BX's Brainfuck interpreter in ELF and James Oakley's DWARF exception handler. We might drop a demo of this soon, i.e. implementing a self-decrypting binary via page faults.
I'm wondering what method PFLA uses to read/write non-code addresses. Only one address per page can be addressed? I'll take a look at the compiler.
By simply expanding the addressing capability, a very tiny program could emulate an instruction stream from memory, overcoming the limited code space (at the cost of execution speed).
Cheers!
This is basically the canonical instruction for OISCs (one instruction set computers). Wikipedia describes it pretty well: https://en.wikipedia.org/wiki/One_instruction_set_computer#S....
https://events.ccc.de/congress/2012/Fahrplan/events/5265.en....
video:
Would that slow things down incredibly?
Is there a way the virtual machine might spawn another virtual machine child of its own?
http://www.cs.dartmouth.edu/~bx/elf-bf-tools/slides/ELF-berl...
386 has both a segment mechanism and a paging mechanism. The segment mechanism has several luxury features like automatic saving and restoring of task context. It is possible to set up a segment descriptor so that the CPU, when jumping (or faulting) to an address in the segment, will automatically save task state at one address (taken from a register) and restore task state from another address (taken from the descriptor). Ibelieve that's what they use here. Hence, free memory accesses.
Edit: browsed around the code some more and it seems the ascii visualization stuff is done in regular c/asm polling the "virtual game of life" mem(?)
The spec says "don't put a tss cross page boundary". This might also have been the case in some x86 implementations, but surely they fixed that once that hardware virtualization was implemented.
They actually depend on the sane behavior: they trigger the double fault when the cpu spills the tss across a page boundary.
They mention the manual because according to the manual their idea wouldn't work, but it does.
Then they also say "We should test it", and the result of the test is "CPU translates DWORD by DWORD" (i.e. doesn't miss a byte^H^H^H^Hword) And then a nice kitty to show how relieved they were from the good news! No dragons in the way.
--
(Unless I got all wrong, but this topic is not the simplest one to reverse engineer from some slides, did anybody find a presentation video? It would really help to hear somebody commenting those slides, otherwise is a nested puzzle. And where is the patched qemu so one can play with that?)
Unless the encryption key is guarded by something with SMM privileges -- has that been done?