X86 MMU fault handling is turing complete

X86 MMU fault handling is turing complete(github.com)

417 points by mman 13 years ago | 39 comments

tptacek 13 years ago |

This is more or less the greatest thing I've learned about in the last couple years.

What's happening here is that they're getting computation without executing any instructions, simply through the process of using the MMU hardware to "resolve addresses". The page directory system has been set up in such a way that address resolution effects a virtual machine that they can code to.

This works because when you attempt to resolve an invalid address, the CPU generates a trap (#PF), and the handling of that trap pushes information on the "stack". Each time you push data to the stack, you decrement the stack pointer. Eventually, the stack pointer underflows; when that happens, a different trap (#DF) fires. This mechanism put together gives you:

    if x < 4 { goto b } else { x = x - 4 ; goto a }

also known as "subtract and branch if less than or equal to zero", also known as "an instruction adequate to construct a one-instruction computer".

The virtual machine "runs" by generating an unending series of traps: in the "goto a" case, the result of translation is another address generating a trap. And so on.

The details of how this computer has "memory" and addresses instructions is even headachier. They're using the x86 TSS as "memory" and for technical reasons they get 16 slots (and thus instructions) to work with, but they have a compiler that builds arbitrary programs into 16-colored graphs to use those slots to express generic programs. Every emulator they could find crashes when they abuse the hardware task switching system this way.

Here's it running Conway's Life:

http://youtubedoubler.com/?video1=E2VCwBzGdPM&start1=0&#...

Here's their talk for a few months back:

http://www.youtube.com/watch?v=NGXvJ1GKBKM

The talk is great, but if you're not super interested in X86/X64 memory corruption countermeasures, you might want to skip the first 30 minutes.

dbarlett 13 years ago | |

Reminds me of "ICMP delay-line memory": http://stackoverflow.com/questions/12748246/sorting-1-millio...

0x0 13 years ago | |

The slides in the github repo ( https://github.com/jbangert/trapcc/blob/master/slides/PFLA-s... ) also have a few interesting points, like "No publicly available simulator implements this correctly" (how did they record the youtube video?) and a few vague hints about exploiting this for doing VM escapes.

jey 13 years ago | | |

> "No publicly available simulator implements this correctly" (how did they record the youtube video?)

Probably by patching the emulator.

calt 13 years ago | | |

> how did they record the youtube video?

By running it on a physical machine. Unless there is a requirement that the processor not be multitasking that I am missing.

tptacek 13 years ago | | |

By fixing bugs in Bochs?

jd007 13 years ago | |

Isn't it a bit misleading to say that they are getting computation without executing any instructions? It's true that they are not executing any x86 instructions from the CPU, but the MMU is doing all the work by executing its instructions of address resolution.

Actually is it even true that they are not executing any x86 instructions on the CPU? From my understanding, the handling of page faults needs the CPU to execute some instructions. Maybe I'm wrong, if so could you enlighten me? Thanks.

simias 13 years ago | | |

I'm not very familiar with the x86 architecture, but usually when there's a fault the CPU generally attempts to lookup the address of a "callback" function in an interrupt/fault vector.

I suppose if you setup everything very carefully you can make it fault over and over again without giving it the time to execute any instruction.

Without looking into the specifics, I think it's very possible that the CPU is not actually executing any instructions, just waiting for the MMU to get a hold of itself. After all, in order to simply load the instructions you need the MMU to be responsive (or deactivated I suppose, if there's such a thing as no-MMU x86).

jbangert 13 years ago |

Author here: While it is true that with the current implementation, memory access is extremely limited (essentially one DWORD per page, or about 0.1% of the available physical RAM) that limitation can certainly be avoided. For one, you could shift how the TSS is aligned (and align them differently for different instructions), multiplying your address space by a factor of 10 or so. Furthermore, you could also place another TSS somewhere in memory (only a few of the variables need to actually contain sane values) with an invalid EIP and use that as a 'load' instruction.

The easiest way however would be to use the TrapCC mechanism to transfer control between bits of normal assembler code (perhaps repurposed from other functions already in your kernel), doing something similar to ROP. Of course, for additional fun, feel free to throw in BX's Brainfuck interpreter in ELF and James Oakley's DWARF exception handler. We might drop a demo of this soon, i.e. implementing a self-decrypting binary via page faults.

sounds 13 years ago | |

"memory access is extremely limited (essentially one DWORD per page" – referring to non-code addresses, yes? In the current (simplest) implementation, each instruction (a TSS) must be aligned across a page boundary. You do comment below that altering alignment could increase the available code space.

I'm wondering what method PFLA uses to read/write non-code addresses. Only one address per page can be addressed? I'll take a look at the compiler.

By simply expanding the addressing capability, a very tiny program could emulate an instruction stream from memory, overcoming the limited code space (at the cost of execution speed).

Cheers!

networked 13 years ago |

>Move, Branch if Zero, Decrement

This is basically the canonical instruction for OISCs (one instruction set computers). Wikipedia describes it pretty well: https://en.wikipedia.org/wiki/One_instruction_set_computer#S....

majke 13 years ago |

There was a talk on 29c3 about this. Abstract:

https://events.ccc.de/congress/2012/Fahrplan/events/5265.en....

video:

https://www.youtube.com/watch?v=NGXvJ1GKBKM

codex 13 years ago |

Another place for root kits to hide.

tomrod 13 years ago | |

Could one write a preemptive rootkit that sniffs for other rootkits?

Would that slow things down incredibly?

simias 13 years ago | | |

That's called an antivirus :)

iamrohitbanga 13 years ago | |

can you please elaborate?

jpollock 13 years ago | | |

It's computation that you can't see with a debugger, or with any sort of tracing.

ars 13 years ago |

How fast (slow) is this relative to the host CPU?

simias 13 years ago | |

Probably incredibly slow given the reduced instruction set and that it relies on context switches/pushing stuff on the stack for functioning.

rocky1138 13 years ago |

This is really interesting. In a way, it's a form of computer self-replication. Could the virtual machine created by the computer be considered offspring?

Is there a way the virtual machine might spawn another virtual machine child of its own?

ithkuil 13 years ago |

if you like this kind of things there is also:

http://www.cs.dartmouth.edu/~bx/elf-bf-tools/slides/ELF-berl...

traxtech 13 years ago |

That the hardware version of the brainfuck philosophy.

switch33 13 years ago | |

Best explanation ever. I second this. lol

conductor 13 years ago |

Expect this technique in the future malwares and software protection DRM systems for making code analyzing harder.

general_failure 13 years ago |

somebody checked in vim backup files :-)