RISC-V J extension – Instructions for JITs(github.com) |
RISC-V J extension – Instructions for JITs(github.com) |
Of course in modern architectures being able to do something in one instruction is only tenuously related to being able to do something quickly, but it was a super handy instruction back in the day.
Equally on such a system the only thing left for FENCE.I to do is to flush any (potentially now bogus) subsequent instructions that are in the execution pipe that might have been prefetched before the writes occurred. In such a system FENCE.I and IMPORT.I are identical.
Hopefully the people writing this spec are listening ... please make sure your spec understands high end systems like this and doesn't add stuff that require special cases in systems that do ubiquitous coherency right
Particular RISC-V Platform specs may end up requiring I/D coherency, like Arm is recommending in SBSA Level 6, but that's left for later, if ever.
To be very very clear: FJCVTZS does not do anything amazing, clever, or special. The problem it solves is very simple: the behaviour of double->int conversion in JS is the default x86 behaviour. Getting that behaviour on any non-x86 platform is expensive. So a more accurate name would be FXCVTZS. The implementation of FJCVTZS in a CPU is also not expensive, it simply requires passing a specific rounding mode to the FPU for the integer conversion (overriding the default/current global mode), and matching the x86 OOB result.
(Also I really wish people would stop posting to GitHub repos unless the repos have the actual readable spec available or linked, rather than the unbuilt markup version. It just makes reading them annoying.)
It seems like the objective of this is to implement different access privileges... but why do you need specialized instructions for this? This is typically done by the OS and memory protection. The pointer masking extension would be to have multiple levels of privilege within a single process? I'm assuming that this is to protect the JIT from a JITted program? Except it's not completely safe, because there might still be bugs in the JIT that could allow messing with the pointer tags. Struggling to think of a real use case.
Also to note that all hardware vendors are adopting hardware memory tagging as the only way to fix C.
Intel messed up with MPX, but I definitely see they coming with an alternative, as I bet they won't like to be seen as the only vendor left without such capabilities.
Already being successfully used for decades in Solaris SPARC, iOS/macOS and Android are increasingly pushing for it on ARM CPUs, Pluton on Azure Sphere OS,...
Seems to me this will have an execution overhead though, and that the best way to improve security would be to finally move beyond C. Most modern languages make buffer overflows impossible.
The basic idea is you often want finer the page level granularity on memory access rights. An example ARM give in the documentation covering the ARM MTE is an allocator. With memory tagging you can make it so unallocated memory in the allocator is not accessible.
Essentially every piece of memory gets a tag, and you can only access a piece of memory through a pointer that has the matching tag. To illustrate imagine an allocator (which is the example ARM have in the documentation for the ARM MTE)
You the allocator has a bunch of memory, and has all of it set to be tagless (uncolored in ARM terminology IIRC):
|bbbbbbbbbb|
When you allocator allocates a byte it does the following:1. Find a free block 2. Choose a tag (randomly if it wants) 3. Set the tag on that memory to the selected tag from (2) 4. returns a pointer to that memory tagged with(2)
So we get something like:
|1bbbbbbbbb|
p = (1,0) // pointer with a tag of 1 and the address 0
Now any access to the memory in at address 0 must be via a pointer with the tag 1, and any memory accessed via that pointer must be tagged with 1So imagine you have a bunch of allocations
|13251bbbbb|
You can see we've re-used a tag, because there is a finite amount of space for tags in a pointer, so while our original allocation was a 1 byte allocation at 0, we can do p[4] and the access will work. However, if we're choosing the tag randomly and attacker is in theory unlikely to be able to luck out and get the correct tag so your process crashes (it's super important for these mechanisms that any failure results in a unstoppable crash, e.g. no signal handlers or anything). Another thing you allocator does is revert memory to being untagged (or I guess tagged distinctly) on free, so a use after free also cannot work.In reality the tagging is not per byte because that would be insane: MTE has a significant increase in the physical ram requirements for a system. If you have an N-bit tag, that means you need to have N extra bits in the physical ram for every granule. I don't know what sort of granule sizes people are looking at but the overhead in physical ram requirements is literally (granule size in bits + bits for tag)/(granule size in bits) so you can see how significant this is.
Unlike PAC, my understanding is there is no cryptographic logic linking the tag to pointer, so pointer arithmetic continues to work without overhead whereas in a PAC model p += 1 say would be: temp = AUTH(p), temp = temp + 1, p = SIGN(temp).
The purpose of PAC is not to protect the memory, but rather the pointer itself. For example imagine you have a C++ object, the basic layout is essentially:
struct {
void* vtable
data fields
}
For those unfamiliar, a vtable is essentially just a list of function pointers to support polymorphism. In this case the vtable pointer is tagged with the appropriate tag for wherever the vtable is. Because the vtable itself is stored in tagged memory it can't be modified by the attacker (in reality tables are all in read only memory, but pretend they're not for this example). But if the attacker can get some random, correctly tagged pointer what they can do is build their own vtable in that memory, and then simply overwrite the vtable pointer with their correctly tagged pointer for the malicious vtable. Of course you can just have the memory holding the object itself also be tagged, so they need the correct pointer tagging for that :DIn the PAC model the pointer is signed by a secret key (it's literally inaccessible to the process) and a nonce (on Mac + iOS this nonce includes the address of the vtable pointer itself). For an attacker to create a valid pointer they need to be able to generate the correct signature over the bits in the pointer and the nonce. Because different nonces are used for pointers in different uses, they can't just get (for example) one object to overwrite another. If the nonce includes the address of the pointer they can't even just copy a validly signed pointer from another location in memory.
I really do like the PAC model a lot, but to me the MTE mechanism seems to be a much stronger protection mechanism, albeit a very expensive one (PAC doesn't require additional ram for the signed pointers).
RISC-V basically eliminates a lot of microarchitectural state (flags), whereas AArch64 updates that state conditionally. We will find out which approach is superior soon.
FJCVTZS is an example of pragmatism, the JavaScript spec says float to int should be done the way that x86 does it, the original ARM FCVTZS (no J) didn't do it the same way, but JavaScript is so important you have to add a special case.
I hope I'm not mischaracterising the RISC-V side, but I seem to recall their argument against things like FJCVTZS was that that there should be some standard set of instructions that compilers should emit for that special case, and the instruction decoder on high end CPUs should be magic enough to detect the sequence and do optimal things (fused instructions?). Which kinda felt like "we must keep the instruction set as simple as possible, even if it makes the implementation of high performance CPUs complex". See also the "compressed instructions" stuff, which feels again like passing the buck for complexity onto the CPU implementation side (unless it's just a Thumb like 16 bit wide instruction set thing given a misleading name).
The compressed instructions are quite lightweight. It's generally an assembly level thing, and the decoder on the cpu side is apparently ~400 gates.
The compressed instructions are indeed a 16 bit wide thing, but fixing some of the flaws in Thumb. Generally they have more implicit operands or operands range over a subset of registers to fit in 16 bits.
But the hat trick is these two dovetail into each other, such that a sequence of compressed instructions can decompress into a fuse-able pair/tuple, which then decodes into a single internal micro op. This creates a way to handle common idioms and special cases without introducing an ever growing number of instructions. Or at least that's the basic claim by the RISC-V folks. I think they've done enough homework on this to not be trivially wrong, so it'll be interesting to see how things go.
I don't know what you think RISC-V "compressed instruction" means. It's precisely equivalent to ARM Thumb2 -- there are 16 bit opcode and 32 bot opcodes and you can tell which you have by looking at 2 bits (RISC-V) or 3 bits (Thumb2) in the first 16 bits of the instruction.
I don't believe there is any practical "magical" sequence of instructions that could be easily recognised to implement Javascript conversion from float to int. If that is in fact as important as ARM apparently think it is (I have my doubts) then an equivalent of FJCVTZS should be added to RISC-V as an extension.
As for "making the implementation of high performance CPUs complex" … high end CPUs are unavoidably complex. A little bit more is not a big deal. On the other hand, adding complexity to low end CPUs can easily be a complete deal-killer. Splitting an instruction into µops might be a little simpler than combining instructions into macro-ops, but it's not as simple as not having to do it.
Ironically, the people who criticise RISC-V for talking about macro-op fusion seem to be ignorant of the fact that no currently shipping RISC-V SoC does macro-op fusion [1], while every current higher end ARM and X86 does do macro-op fusion of compare (and maybe other ALU) instructions with a following conditional branch instruction.
[1] SiFive U74 can tie together a forward conditional branch over a single integer ALU instruction with that following instruction. They pass down the two execution pipes in parallel (occupying both i.e. they are still two instructions, not a macro-op). The ALU instruction executes regardless, but the conditional branch controls whether the result is written back. i.e. it effectively converts a branch into predication
Detecting a long fixed sequence of instructions and "compressing" them into one internal operation seems like it would require a lot of fetch bandwidth and/or a really wide decoder. x86 has had macro-fusion since Core Solo/Duo.
There is nothing magic about it.
A more correct name for FJCVTZS would be FXCVTZS. What FJCVTZS does is override the default FPU rounding and signaling results for double to integer conversion to match the x86 behaviour. There is no special logic needed in the FPU, all that happens is instead of the instruction passing the current thread FPU rounding and clamping flags, it passes the flags that exactly match x86 behaviour.
That's it.
Because the JS label is inaccurate everyone believes it to be useless outside of js, when in reality it's useful to anything that needs x86 behavior for double->int conversion, so any x86 emulators on arm (Qemu, presumably the translation runtimes, etc).
God I hate that they named it that.
A good comparison is R7RS with scheme. The vast majority of it are optional RFCs that exist for the sake consistency and aren't implemented by most schemes. The "mandatory" parts are specified via R7RS-small and work is being done on R7RS-large, though even that won't contain every RFC.
I could see us ending up with an equivalent for RISC-V where a common group of extensions get grouped together as a standard (likely including stuff like virtualization support but excluding vector operations).
Not really. CPUs do out-of-order because cache hits are unpredictable and it is crucial for single-threaded performance to make progress on dependent operations as soon as a loaded value is available.
There may be other, lower order, factors, but variable memory latency is the real reason.
It was already known since the early days how bad C was versus the competition.
UNIX made it famous, UNIX won the server room wars, UNIX will keep it going.
They claim 2%, but only in JS code. I'd guess static analysis of outputted v8/JSC/SM JIT code from the top 100 websites would give a very accurate estimation of the savings. One of the most fundamental performance boosters is using 31-bit ints instead of doubles, but every single time time the user needs to access a number for output, it must be converted to a double to keep the JS spec contract.
All that said, I think only Apple's last 4-6 chips and ARM's most recent generation of chips actually implement the instruction and people have been fine without it. I'd guess we'll not be seeing this in RISC-V until much lower-hanging fruits have been picked.
Does ARM allow any freedom in tag size, or is it strictly 4 bits?
I realize I may not have been clear for people unfamiliar with MTE* tagging is device level so you can't (for example) put the tags in a separate mapping and just increase your usage of existing memory by 3% (obviously a software implementation could do that, but the perf would probably be suboptimal :D ). You literally need X% more dram cells.
* Not saying @my123 doesn't understand, just I can't edit my original comment and I figure contextually this is reasonable :D
> C Language. Dialect ISO C. ISO C source programs invoking the services of this Product Standard must be supported by the registered product.
-- http://get.posixcertified.ieee.org/docs/si-2016.html
I should also note that many attempts to add safer types to C have been tried, WG14 just doesn't care about them.
https://docs.microsoft.com/en-us/cpp/build/reference/kernel-...
> Creates a binary that can be executed in the Windows kernel. The code in the current project gets compiled and linked by using a simplified set of C++ language features that are specific to code that runs in kernel mode.
And then there is WIL, https://github.com/microsoft/wil
https://community.osr.com/discussion/291326/the-new-wil-libr...
> First off, let me point out that this library is used to implement large parts of the OS. There are hundreds of developers here who use it. So unlike, uh, some other things that get tossed onto github, this project is not likely to wither and die tomorrow.
> There are, however, only a handful of kernel developers working on the library, so the kernel support has been coming along much slower. I'd like to expand the existing kernel features in depth ....
However these things often get turned into stronger (or different) arguments as they pass from mouth to ear repeatedly.
Sometimes they change completely, as in "the plural of anecdote is data"
The interesting thing is, I turned on compiler optimisations. When I examined the assembled output (even though my knowledge of assembly is poor), I discovered that it had made the optimisations that you would find in a more complex C implementation. The compiler obviously thought to itself "I see what you're doing here", and put in a better version.
So the moral of the story is: your compiler is likely to be able to figure out a lot.
I love that line!
the fact LLVM allows javascript to be transpiled to C doesn't mean Linux kernel has been rewritten in Javascript
this doesn't mean the ntoskrnl.exe is written in C++
the fact nvidia's linux loadable kernel blob is written in C++ doesn't suddenly mean linux is written in C++
"grasping at straws" would seem to sum up your position
A kernel without drivers, only produces heat.
> "grasping at straws" would seem to sum up your position
Fits exactly the position of someone that desperately wants to assert ntoskernel.exe is written just like when NT 3.51 got released into the world.
"Kernel proper - This is mostly written in C. Things like the memory manager, object manager, etc. are mostly written in C. The boot loaders are written in ASM, but set up a C environment rather quickly.
Drivers - that said, a lot of newer kernel mode drivers are actually written in C++ (however, its style is more akin to "C with classes". Lower level code has been much slower to adopting anything past C++98)
User land - Mostly C++ with varying levels of quality and version compliance. If it's a pre-Windows 8.0 component, it was written against mostly C++98. More recent features are C++14 and better."
-- https://www.reddit.com/r/cpp/comments/4oruo1/windows_10_code...
Bye, have fun with C.
thank you
only took 17 hours to get there, but we finally got there