I am not sure what QEMU's JIT is doing (in its userspace wrapper), but I think it has a lot of room to improve.
In 2013 I wrote a x86-64 to aarch64 JIT engine that was able to run what was then Fedora beta aarch64 binaries and rebuild almost the entire aarch64 port of Fedora on a x86_64 Linux. I also made a reverse aarch64 to x86-64 JIT that worked in the same way, and for fun I also showed the two JITs managing to run each other in a loop back fashion: x86-64 -> aarch64 -> x86_64 in the same process.
The JIT I devised did a 1-to-many instruction and CPU state mapping with overhead that was somewhat 2x to 5x slower than what would be expected to native recompiled code. I later compared this with QEMU's JIT which seemed more in the range of 10x to 50x slower.
Unfortunately this was not under a open source license settings, so no code release to prove it.. :(
It's exciting to see that multithreading and exception handling are not impossible to support; they're just out of scope of this particular project.
I wonder if the next step is to then use heuristics to prune the possibility space and reduce the size of the binary (thus breaking the guarantees of the translation, but making portability of the binary practical).
Nope, this translator is much slower than Box64 or FEX. It's just worse unless you can't use JIT for some reason.
This is slower than a direct jmp (which doesn’t use the table) but also indirect jumps were slower in the original program to begin with and typically don’t occur in performance-critical loops.
The main use-case in performance-critical loops is generally something like a core interpreter loop, where you're dispatching on an opcode.
- ~4.75x runtime speed increase (significantly slower than box64, faster than QEMU), 7x executed instruction count increase, 50x binary size increase
- emulates x86 abi until it calls out to external
- has to emulate a large part of x86 cpu state like EFLAGS, compute complex movs individually, etc
- only supports single-thread binaries
- no exception handling/unwinding
- doesnt support the full ISA
I mostly work on stuff from the 90s, but disassemblers make a lot of assumptions about where code starts and ends, but occasionally a binary blob is not discoverable unless you have some prior knowledge (pointer at a fixed location to an entry point).
I would think after a few passes you could refine the binary into areas that are definitely code.
So any real program with the possibility to crash is pruned?
Before suggesting to use LLMs to completely rewrite this sort of software, there is a reason why compilers need to be certified to operate in safety critical environments. Not everything needs to use LLMs as the solution to a problem.
I would go as far to say that using an LLM in this context is the wrong solution and is irrelevant to critical systems. Maybe some here see everything as tokens and must solve everything in the form of using LLMs.
Rewriting a toy web app using LLMs from Javascript to Typescript is great, but isn't good for safety critical systems.
It's also not aviation or medical. So perhaps it's more common than you imagine.
Why only x86_64? It has more sense to convert 32-bit programs, like many old games.
char buf[] = {0xB8, 0x2A, 0x00, 0x00, 0x00, 0xC3};
return ((int (*)(void))buf)();
static translation is only possible when you assume no adversarial code AND mostly assume compiler-produced binaries. hand-rolled asm gets hard, and adversarial code is provably unsolvable in all cases.still, pretty cool for cooperative binaries
Edit I found this in the paper
> Elevator sidesteps the code-versus-data determination altogether through an application of superset disassembly [6]: we simultaneously interpret every executable byte offset in the original binary as (i) data and (ii) the start of a potential instruction sequence beginning at that offset, and we build the superset control flow graph from every one of the resulting candidate decodes. Every potential target of indirect jumps, callbacks, or other runtime dispatch mechanisms that cannot be statically analyzed therefore has a corresponding landing point in the rewritten binary. These targets are resolved at runtime through a lookup table from original instruction addresses to translated code addresses that we embed in the final binary.
executable stacks are still common (incl on windows with some settings), and sometimes they are required (eg for gcc nested functions)
/s /jk
Maybe try an emulator? There's also this project I found: https://github.com/andirsun/Slacky
> Self Modifying and JIT-Compiled Code. Elevator, like all fully static binary rewriters, does not support self modifying or just-in-time-compiled code.
In x86 land, it's hard to find the instruction boundaries statically, because, for historical reasons going back to the 8-bit era, x86 nstructions don't have alignment restrictions. This is what makes translation ambiguous.
If you start at the program entry point and start examining reachable instructions, you can find the instruction boundaries. Debuggers and disassemblers do this. Most of the time, it works, but You may have to recognize things such as C++ vtables. Debug info helps there. There may be ambiguity. This seems to be about generating all the possible code options to resolve that ambiguity by brute force case analysis.
x86 doesn't have explicit code/data separation, which some architectures do. So they have to try instruction decoding on all data built into the executable. They cull obvious mistranslations. Yet they still have a 50x space expansion, someone mentioned. Most of those will be unreachable mistranslated code.
You can't look at a static executable which uses pointers to functions and say "that data cannot possibly be code", without constraining what those pointers point to. That involves predicting run-time behavior, which may not be possible.
There's a lot of x86 crufty edge-cases to handle to achieve perfect(ish) emulation or translation.
After those machines, at the Pentium Pro, with look-ahead instruction decoding, it became a major lose to store into code. Superscalar x86 CPUs have the hardware to detect and handle stores into code, but it requires bringing the CPU to a clean halt, almost like an exception interrupt, discarding pipelined work that's already been done, and then restarting the pipeline, reloading the instructions ahead. All the performance gains of superscalar hardware is lost for a while.
There are RISC architectures where self-modifying code isn't supported, and code pages must be read-only. Then the CPU doesn't need the machinery for detecting and aborting look ahead on a store into code. MacOS has enforced that rule since the PowerPC era.
If it did, it wouldn't be "fully static" anymore. It's fundamentally contradictory.
* A Case Study on the Effectiveness of LLMs in Verification with Proof Assistants, https://arxiv.org/abs/2508.18587v1
* CoqPilot, a plugin for LLM-based generation of proofs, https://dl.acm.org/doi/10.1145/3691620.3695357
Certainly it's not on the "AI industries" list of priorities. Perhaps, however, it's not supposed to be. I use AI tools for the use case I mentioned. The source code, build system, binary artifacts and hashes are still regulated in the way I described. The fact that the AI industry was involved in that chain simply isn't relevant.
Other uses cases involving real time agents and whatnot are another story. I'm not dealing with that problem. I suspect the AI industry doesn't really care about such attestation at this point because everyone is still in the frothy world of "new!" and the bureaucrats simply haven't caught up yet, and the adopters are taking advantage while they can. That pattern has recurred throughout the history of communication and computers.
I don't really object to that. There will be plenty of time for security theater after whatever limits are eventually found and exploited, and in the meantime there is free oxygen available.