RISC-V: The New Architecture on the Block(klarasystems.com) |
RISC-V: The New Architecture on the Block(klarasystems.com) |
Many of the higher performance RISC-V designs do, in fact, do speculation. RISC-V BOOM[0], by Berkeley, is vulnerable to Spectre[1][2]. One of the attempts to create an extension to the RISC-V ISA that has integrated security features (CHERI, [3]) itself was shown to be vulnerable to Spectre-like attacks[4].
The fact that most RISC-V chips were not vulnerable to Spectre is simply because they hadn't implemented a particular kind of performance optimization, not because there was anything intrinsic to the ISA that prevented them from being so.
[1]: https://github.com/abejgonzalez/boom-attacks
[2]: https://boom-core.org/docs/replicating_mitigating_spectre_ca...
[3]: https://www.cl.cam.ac.uk/research/security/ctsrd/cheri/cheri...
[4]: https://kth.diva-portal.org/smash/get/diva2:1538245/FULLTEXT...
It's not actually very hard to avoid these kids of vulnerabilities. All that is needed is to not permanently update state until the instruction is no longer speculative.
For example:
- don't update the branch prediction tables until the branch is proven to execute
- if loading a cache line causes another cache line to be evicted, keep both until it is known whether the load is supposed to execute
This requires provisioning a few more of these kinds of resources than you might previously have had, which costs a little silicon area, but it doesn't cost speed. Sometimes particularly demanding code might cause you to run out of these speculation resources and then you have to stall until an entry is freed up. This can already happen with things such as store buffers. If it never happens then you've probably over-provisioned :-)
> - if loading a cache line causes another cache line to be evicted, keep both until it is known whether the load is supposed to execute
This sounds interesting. How do these countermeasures actually prevent Spectre V1? And if they do, why didn't Intel implement them? Seems like they are fully microarchitectural, and therefore opaque to the software world.
Some of the design decisions, and their expressed rationale, are considered unpersuasive by many involved with other architectures. For example, a status register, cited as interfering with optimal out-of-order execution, turns out not to be a problem in actual chips (where they rename it like other registers), so was omitted from the RISC-V design on what amounts to superstition. Some instruction sequences that would need to be "fused" to match performance of common chips involve many more instructions than are fused in any extant design, so it is unclear that such fusion would be practically achievable.
ISAs without condition codes have been around for a long time, and very technically successful. MIPS and DEC Alpha, for example. (both killed by clueless management, not any technical issue)
The vast vast majority of condition code updates are either never used at all or are used by the very next instruction. In either case, there is little point in reifying them and no point in saving them.
Generating a condition a few instructions before it is used happens from time to time, at least on ISAs where only some instruction types update the condition codes, or there is a flag in the instruction to indicate whether to or not. An ISA without condition codes but with plenty of registers can do the same thing using SLT/SLTU (Set if Less Than [Unsigned]) to generate a 0 or 1 in a normal register. Or a simple XOR or SUB for equality tests.
Historically, use of condition codes is because your instructions aren't big enough to contain two source operands, a test, and a reasonable branch offset. Now it's because you're descended from such an ISA.
Similarly, many early ISAs did conditional skip instead of conditional branch because their instructions weren't big enough to test a condition and also hold a useful branch offset. Some of them could integrate a compare with the skip, but some of them needed three instructions: compare -> CCs; skip based on CCs; jump. Not high performance.
Compare and branch, all in one instruction, is best most of the time if you have the opcode space for it.
https://ocw.mit.edu/courses/electrical-engineering-and-compu...
They have been Turing complete from the first, so the differences are limited to speed, power consumption, and incidentals.
1) taking control over transient instruction execution
2) controlled transient instructions access a legal data (but illegal for us)
3) controlled transient instructions exfiltrate this data through a side channel between the microarch and the arch
For Spectre V1: Step 1) is performed by the Branch-Predictor, step 2) depends on the gadget targeted within the victim code, and step 3) is completed using a FLUSH+RELOAD or a EVICT+RELOAD triggered by a transient load.
If one of these 3 steps is not met then the attack is impossible. The brucehoult proposal (obviously not the first to suggest this) is to eliminate step 3): if no transient execution side effect/microarchitectural state is made observable, then there is no way to exfiltrate data. All Spectre/Meltdown attacks are therefore made unfeasible.
The problem is that brucehoult's proposal does not guarantee that all side-channels are infeasible at all, it only guarantees that side channels based on branch prediction or caching are no longer possible.
Furthermore, microarchitectural optimizations are made to have an observable effect on the execution time. Therefore, it's likely that other timing side-channels will be exposed/discovered/used.
Setting a 0 or 1 in SLT was another design error. People designing GPUs demonstrate that they know the better design sets a 0 or ~0 (all ones).
Huge instructions have been regretted enough to motivate abbreviated versions. Even in RISC-V.
And, as has already been noted in this forum, lack of a reliably available popcount instruction has been subsequently corrected, at great expense, practically everywhere.
All of which really only means I'm ready for Risc-6. With some care, it should be able to re-use much of the ecosystem work from RISC-V.
Going from 0/1 to 0/~0 (or conversely) just takes a NEG instruction. All in all, it's a trivial difference. And it's hard to say what's more convenient in actual code.
It was an error, though a rather minor one, to follow the C language so closely. I can and have pointed out other minor mistakes in RISC-V in the past -- none of them serious enough to abandon it and start over.
I'll quote myself from there, below.
32 bits is not such a huge instruction. ARM decided it's good enough for their new(ish) 64 bit ISA, and it's about the average size of x86_64 instructions.
Original RISC-V (v1.0) has only and exactly the instructions needed to implement C. That's enough for many or most applications, and will be available as a support option forever. The upcoming RVA22 specification for Applications Processors, which will be ratified before the end of the year includes an SVE-like vector extension and also Bit Manipulation extensions (along with many others). The Zbb (Basic bit-manipulation) extension includes cpop along with clz and ctz and rotate. There is also andn, orn, xnor, max, maxu, min, minu, sext.b, sext.h, zext.h, and rev8 (reverse bytes in a register). Plus a unique instruction orc.b which replaces any non-zero byte in the source operand with all ones. There is also scalar crypto and cache manipulation (prefetch, flush etc).
Perhaps RVA22 is your hypothetical Risc-6.
-----
There are five reasons you might use SLT / SLTU, in (I think) descending order of how common they are, and the implications had -1 been used instead of 1:
1) to generate a zero/non-zero value. No difference.
2) to generate a mask. Using 0 and -1 is better, saving a NEG or a subtract 1, depending on whether you reverse the condition or not.
3) to generate a value that can be AND / OR / XOT etc with other such values. No difference.
4) to assign to a canonical C/C++ true/false, or mix with them using AND / OR / XOR. Worse -- have to do an ANDI #1 before using the final result.
5) to generate a canonical C true/false and add or subtract it from something. No difference. Just flip add to subtract or vice versa.
Interestingly, a time when you do want 0 or 1 is the examples in the original superoptimiser paper from 1987.
https://web.stanford.edu/class/cs343/resources/superoptimize...
They first considered the function:
int signum (int x) {
if(x > 0) return I;
else if(x < 0} return -I;
else return 0;
)
They showed the superoptimiser finding the following unexpected 68020 sequence, making use of the carry flag: (x in dO)
add.l d0,d0 ;add dO to itself
subx.l dl,dl ;subtract (dl + Carry) from dl
negx.l dO ;put (0 - dO - Carry) into dO
addx.l dl,dl ;add (dl + Carry) to dl
(signum(x) in dl} (4 instructions}
This is much more straightforward on RISC-V: (x in a0)
slt a1,a0,zero # a1 = 1 if x is negative, 0 if 0 or positive
slt a0,zero,a0 # a0 = 1 if x is positive, 0 if 0 or negative
sub a0,a0,a1 # 1-0 = 1 if positive, 0-0 = 0 if zero, 0-1 = -1 if negative
-----AIUI, if SLT returns 0 or -1 you can then reverse the arguments to SUB and get a correct result. If you return the result in a1 you can also keep the 2-operand compressed form of SUB, so there's effectively no difference. Equivalently, you can keep the SUB insn unchanged (thus using a 2-operand form to return in a0) while flipping the previous SLT insns: SLT a1, zero, a0; SLT a0, a0, zero.
The very late addition of the reified B extensions (and others) will be a continuing problem, as builds will not be able to count on them having been implemented. (Trap emulation would be much worse than useless.) The lack of rotate operations in the base instruction set is a problem for implementing modern encryption systems. On embedded chips likely to appear in routers and switches, "extensions" such as the Bs are especially likely to be omitted.
It would not be necessary to abandon the work on RISC-V to do a Risc-6. Most of the work done could carry over.
This doesn't apply to floating point, where condition flags (confusingly called exceptions in IEEE parlance) are mandated by IEEE 857 and afaik Risc-V implements them conformantly. I don't see why they couldn't also do something like that for integers.
Modern x86_64 OSes such as Windows and Linux run on everything back to the original Opteron and Athlon 64 from 2003, which don't have POPCNT and LZCNT. Those were implemented by AMD starting with Bobcat and Bulldozer in 2011. Intel added POPCNT in Nehalem in 2008 and LZCNT in Haswell in 2013.
Aarch64 got both from the start, but there are other things added in ARMv8.1-A through ARMv8.8-A (and ARMv9) which are presumably also useful to certain software.
Embedded chips used in routers and switches will take exactly the extensions useful to them and none that aren't. If Zbb is useful to them then they will certainly include it -- that's why the extensions are specified so finely with three non-overlapping extensions for BitManip being defined this year. Applications processors running shrink-wrapped OSes are required to take all the extensions in RVA22 (or none). The embedded world picks and chooses what they want.
Chips used in routers and switches will be exactly what is cheapest, just as now, regardless of what performs best or adequately. Thus, they will lack B extensions, howsoever useful they might have been.
That's way overstated. RISC-V is still an amazingly clean and elegant design, placing extreme focus on technical excellence and on making effective use of limited insn encoding space. (Just look at how cautious the ratification of B and V has been - some of that was due to wanting to maximize feasible overlap between B and other exts, so as to avoid wasting even the smallest fractions of insn space). Tiny warts like SLT returning 0/1 as opposed to 0/-1 don't change that in any way.
"Tiny warts" reveal mindset: how aware are the designers of the consequences of their choices? Each is a clue. Lack of rotate and popcount instructions in the core instruction set provides a clue. Expectation that five-instruction sequences can be fused might be another. (When your instructions are already 4 bytes or more, each, five means at least 20 bytes for a single primitive operation.) The extremely complicated extensions landscape is another.
You are confusing embedded applications, which have huge flexibility with RISC-V, and standard operating systems with packaged software.
For the next few years (5?) standard operating systems have to support exactly two choices:
- RV64GC
- RVA22
RVA22 includes all the bit manipulation instructions, vectors, cache management, scalar crypto, and some other stuff. You can't pick and choose -- you have to support it all.
If you are making an embedded appliance on the other hand you can pick and choose exactly what extensions you need (a huge number of combinations, as you say), specify a core with exactly those extensions, build a chip around that with the other IP blocks you need, and tell your compiler which extensions you have. You compile all your software yourself, whether bare metal, using an RTOS, or a minimal Linux such as builtroot or yocto. There is zero confusion because you know what you have and you have what you need -- no more and no less.
No one who knows what they are talking about is talking about fusing five-instruction sequences. That's a total red herring.
The assertion that rotate and popcount instructions are unimportant is false. All compilers peephole-optimize to generate rotate instructions where supported, and not because nobody needs that. There is a long history of mis-estimating instructions and their importance, going back to optimizing an instruction used only in a kernel idle loop.
A more objective measure is to note how often a neglected instruction has needed to be added after the first ISA version shipped, because its lack handicapped the chips on the market. Popcount wins that race everywhere: always neglected, always added. Its neglect reveals the blinders of the CS academics who do the initial ISA designs, and the need to patch reveals the reality.
The importance of an instruction is poorly represented by both its static frequency and by its total execution frequency for the same reason as that idle-loop instruction was miscounted: the importance of lines of code varies by many orders of magnitude, and there is no way to measure importance when counting. It is easy to prioritize instructions used in signature benchmarks, but they are a cracked mirror.
The market is another cracked mirror: it takes a very large signal to penetrate it. Any that does merits attention.
The part of the embedded market that uses off the shelf chips may have more designs, but they are each low volume enough that they are dominated by engineering time (and thus employ more engineers, who are vocal online) not by saving a few cents on an MCU that doesn't have the instruction you want.
The packaging and testing and stocking costs of a chip with nothing more than a generic MCU inside it are such a high proportion of the cost compared to the actual die that it would be silly to leave any available low silicon cost extensions out, unless done for monopolistic market segmentation reasons e.g. you can't buy a Cortex M0+ with an FPU at any price because ARM would prefer to sell you an M4F for much more money.
The RISC-V market with many vendors with many cores is not prone to such artificial market segmentation.
If you want the equivalent of an M0 (short pipe, no cache, no branch prediction etc) but with an FPU or with 64 bits or with a vector unit then RISC-V vendors say "sure, no problem".
All of that is fine for somebody designing for a million-unit SSD, who doesn't need to read any of this, and for chip vendors selling to that person.
But for each such somebody, literally thousands are stuck with whatever chip purchasing says they can get cheap enough off the shelf. Those chips will be exactly the ones that somebody else ordered 100M of without considering for even a second what the thousands of others whose experience they dictate need.
And, it remains a fact that none of the RISC-V MCU chips I can buy off the shelf have any of the B extension instructions implemented.
As RISC-V extensions are developed using a cooperative process between domain experts at dozens of different companies and educational or scientific institutions it is obviously impossible and unproductive to attempt to do this in secret.
You can't buy a chip with the B extension because the B extension isn't ratified yet so anyone who claimed to make a B-compatible chip would be taking a risk that the spec might change incompatibly before ratification. The spec was frozen in June and the 45 day Public Comment phase was held in June/July. As far as I know, no issues were raised. The extension (actually three of them, covering different areas) will be ratified before the end of the year, along with several others including V.
That's obviously a very short time scale compared to making a chip. Any chip you can buy now would have been taped out in 2019 or the first part of 2020.
RISC-V is very new. If you want to ignore it until 2023 then feel free. Others find it useful how it is now.