RISC-V Is Sloooow(marcin.juszkiewicz.com.pl) |
RISC-V Is Sloooow(marcin.juszkiewicz.com.pl) |
RISC-V will get there, eventually.
I remember that ARM started as a speed demon with conscious power consumption, then was surpassed by x86s and PPCs on desktops and moved to embedded, where it shone by being very frugal with power, only to now be leaving the embedded space with implementations optimised for speed more than power.
1) https://github.com/llvm/llvm-project/issues/150263
2) https://github.com/llvm/llvm-project/issues/141488
Another example is hard-coded 4 KiB page size which effectively kneecaps ISA when compared against ARM.
All of these extensions are mandatory in the RVA22 and RVA23 profiles and so will be implemented on any up to date RISC-V core. It's definitely worth setting your compiler target appropriately before making comparisons.
The problem is decades of software being written on a chip that from the outside appears not to care.
Page size can be easily extended down the line without breaking changes.
Huh? They have no idea what they are doing. If data is unaligned, the solution is memcpy, not compiler optimizations, also their hack of 17 loads is buffer overflow. Also not ISA spec problem.
Not trolling: I legitimately don't see why this is assumed to be true. It is one of those things that is true only once it has been achieved. Otherwise we would be able to create super high performance Sparc or SuperH processors, and we don't.
As you note, Arm once was fast, then slow, then fast. RISC-V has never actually been fast. It has enabled surprisingly good implementations by small numbers of people, but competing at the high end (mobile, desktop or server) it is not.
I'm a chip designer and I see people using RISC-V as small processor cores for things like PCIE link training or various bookkeeping tasks. These don't need to be fast, they need to be small and low power which means they will be relatively slow.
Most people on tech review sites only care about desktop / laptop / server performance. They may know about some of the ARM Cortex A series CPUs that have MMUs and can run desktop or smartphone Linux versions.
They generally don't care about the ARM Cortex M or R versions for embedded and real time use. Those are the areas where you don't need high performance and where RISC-V is already replacing ARM.
EDIT:
I'll add that there are companies that COULD make a fast RISC-V implementation.
Intel, AMD, Apple, Qualcomm, or Nvidia could redirect their existing teams to design a high performance RISC-V CPU. But why should they? They are heavily invested in their existing x86 and ARM CPU lines. Amazon and Google are using licensed ARM cores in their server CPUs.
What is the incentive for any of them to make a high performance RISC-V CPU? The only reason I can think of is that Softbank keeps raising ARM licensing costs and it gets high enough that it is more profitable to hire a team and design your own RISC-V CPU.
The most promising RISC-V companies today have not set out to compete directly with Intel, AMD, Apple or Samsung, but are targeting a niche such as AI, HPC and/or high-end embedded such as automotive.
And you can bet that Qualcomm has RISC-V designs in-house, but only making ARM chips right now because ARM is where the market for smartphone and desktop SoCs is. Once Google starts allowing RVA23 on Android / ChromeOS, the flood gates will open.
However, it takes time from microarchitecture to chips, and from chips to products on shelves.
The very first RVA23-compatible chips to show up will likely be the spacemiT K3 SoC, due in development boards April (i.e. next month).
More of them, more performant, such as a development board with the Tenstorrent Ascalon CPU in the form of the Atlantis SoC, which was tapped out recently, are coming this summer.
It is even possible such designs will show up in products aimed at the general public within the present year.
That's true, but tautological.
The issue is that the RISC-V core is the easy part of the problem, and nobody seems to even be able to generate a chip that gets that right without weirdness and quirks.
The more fundamental technical problem is that things like the cache organization and DDR interface and PCI interface and ... cannot just be synthesized. They require analog/RF VLSI designers doing things like clock forwarding and signal integrity analysis. If you get them wrong, your performance tanks, and, so far, everybody has gotten them wrong in various ways.
The business problem is the fact that everybody wants to be the "performance" RISC-V vendor, but nobody wants to be the "embedded" RISC-V vendor. This is a problem because practically anybody who is willing to cough up for a "performance" processor is almost completely insensitive to any cost premium that ARM demands. The embedded space is hugely sensitive to cost, but nobody is willing to step into it because that requires that you do icky ecosystem things like marketing, software, debugging tools, inventory distribution, etc.
This leads to the US business problem which is the fact that everybody wants to be an IP vendor and nobody wants to ship a damn chip. Consequently, if I want actual RISC-V hardware, I'm stuck dealing with Chinese vendors of various levels of dodginess.
A lot of times the path to the highest performing CPU seems to be to optimize for power first, then speed, then repeat. That's because power and heat are a major design constraint that limits speed.
I first noticed this way back with the Pentium 4 "Netburst" architecture vs. the smaller x86 cores that became the ancestor of the Core architecture. Intel eventually ran into a wall with P4 and then branched high performance cores off those lower-power ones and that's what gave us the venerable Core architecture that made Intel the dominant CPU maker for over a decade.
ARM's history is another example.
In comparison, I think Arm is actually a very strong cautionary tale that focusing on power will not get you to performance. Arm processors remained pretty poor performance until designers from other CPU families entirely (PowerPC and Intel) took it on at Apple and basically dragged Arm to the performance level they are today.
https://stackoverflow.com/questions/45066299/was-there-a-p4-...
Banias was hyper optimized for power, the mantra was to get done quickly and go to sleep to save power. Somewhere along the line someone said "hey what happens if we don't go to sleep?" and Core was born.
Is this because Fedora 44 is going to beta?
The optimizations that'd be applied to ARM and MIPS would be equally applicable to RISC-V. I do not believe this is a lack of software optimization issue.
We are well past the days where hand written assembly gives much benefit, and modern compilers like gcc and llvm do nearly identical work right up until it comes to instruction emissions (including determining where SIMD instructions could be placed).
Unless these chips have very very weird performance characteristics (like the weirdness around x86's lea instruction being used for arithmetic) there's just not going to be a lot of missed heuristics.
There's no carry bit, and no widening multiply(or MAC)
BTW, it's quite impressive how the s390x is so fast per core compared to the others. I mean, of course it's fast - we all knew that.
And don't let IBM legal see this can be considered a published benchmark, because they are very shy about s390x performance numbers.
At this point the most likely place for truly competitive RISC-V to appear is China.
That's what makes writing specs hard - you need people who understand implementation challenges at the table, not dreaming architects and academics.
In theory you can spend a lot of effort to make a flawed ISA perform, but it will be neither easy nor pretty e.g. real world Linux distros can't distribute optimised packages for every uarch from dual-issue in-order RV64GC to 8-wide OoO RV64 with all the bells and whistles. Only in (deeply) embedded systems can you retarget the toolchain and optimise for each damn architecture subset you encounter.
Over a decade ago: https://news.ycombinator.com/item?id=8235120
RISC-V will get there, eventually.
Strong doubt. Those of us who were around in the 90s might remember how much hype there was with MIPS.
It was not designed to be one, but it ended up being surprisingly fast.
First, we do have a recent 'binutils' build[1] with test-suites in 67 minutes (it was on Milk-V "Megrez") in the Fedora RISC-V build system. This is a non-trivial improvement over the 143-minute build time reported in the blog.
Second, the current fastest development machine is not Banana Pi BPI-F3. If we consider what is reasonably accessible today, it is SiFive "HiFive P550" (P550 for short) and an upcoming UltraRISC "DP1000", we have access to an eval board. And as noted elsewhere in this thread, in "several months" some RVA23-based machines should be available. (RVA23 == the latest ISA spec).
FWIW, our FOSDEM talk from earlier this year, "Fedora on RISC-V: state of the arch"[1], gives an overview of the hardware situation. It also has a couple of related poorman's benchmarks (an 'xz' compression test and a 'binutils' build without the test-suite on the above two boards -- that's what I could manage with the time I had).
Edit: Marcin's RISC-V test was done on StarFive "Vision Five 2". This small board has its strengths (upstreamed drivers), but it is not known for its speed!
[1] https://riscv-koji.fedoraproject.org/koji/taskinfo?taskID=91...
[2] Slides: https://fosdem.org/2026/events/attachments/SQGLW7-fedora-on-...
Assuming they will keep their word, later this year Tenstorrent is supposed to ship their RVA23-based server development platform[1]. They announced[2] it at the last year's NA RISC-V Summit. Let's see.
The ball is in the court of hardware vendors to cook some high-end silicon.
[1] https://tenstorrent.com/ip/risc-v-cpu
[2] https://static.sched.com/hosted_files/riscvsummit2025/e2/Unl...
I think the ban of SOPHGO is part to blame for the slow development.[2] They had the most performant and interesting SOCs. I had a bunch of pre-orders for the Milk-V Oasis before it was cancelled. It was supposed to come out a while ago, using the SG2380, supposedly much more performant than the Milk-V Titan mentioned in the article (which still isn't out).
It was also SOPHGO's SOCs that powered the crazy cheap/performant/versatile Milk-V DUO boards. They have the ability to switch ARM/RISC-V architecture.
[1]: https://archriscv.felixc.at/
[2]: https://www.tomshardware.com/tech-industry/artificial-intell...
What I do know is since the ban, all ongoing products featuring SOPHGO SOCs were cancelled, and I haven't seen any products featuring them since. The SOPHGO forums have also closed down.
The Milk-V Oasis would have had 16 cores (SG2380 w/ SiFive P670), it was replaced by the Milk-V Megrez with just 4 cores (SiFive P550) for around the same price. The new Milk-V Titan has only 8. We're slowly catching up, but the performance is now one or two years behind what it could've been.
The SG2380 would've been the first desktop ready RISC-V SOC at an affordable price. I think it's still the only SOC made that used the SiFive P670 core.
That's not how it usually works :\
RISC-V is certainly spreading across niches, but performant computing is not one of them.
Edit: lol the author mentions the same! Perhaps they were the source of the original Mastodon post I'm thinking of.
Obviously a solvable problem to split build and test but perhaps the time savings aren't worth the complexity.
https://src.fedoraproject.org/rpms/rpy/pull-request/4#reques...
Your question made me look up Arm's history in Fedora and came up on this 2012 LWN thread[1]. There's some discussion against cross-compilation already back then.
The fact that i686 is 14% faster than x86_64 is a little suspicious, because usually the same software runs _faster_ on x86_64 (despite the increased memory use) thanks to a larger register set, an optimized ABI, and more vector instructions.
Of course, if you are compiling an i686 binary on i686, and an x86_64 binary on x86_64, then the compilers aren't really doing the same work, since their output is different. I'm not a compiler expert, but I could imagine that compiling x86_64 binaries is intrinsically slower than for i686 for a variety of reasons. For example, x86_64 is mostly a superset of i686, so a compiler has way more instructions to consider, including potential optimizations using e.g. SIMD instructions that don't exist on i686 at all. Or a compiler might assume a larger instruction cache size, by default, and do more unrolling or inlining when compiling for x86_64. And so on.
In that case, compiling on x86_64 is slower not because the hardware is bad but because the compiler does more work. Perhaps something similar is happening on RISC-V.
Which boards are used specifically should not matter much. There's not much available.
Except for the Milk-V Pioneer, which has 64 cores and 128GB ram. But that's an older architecture and it's expensive.
That is the same as the time he quotes for the unidentified Aarch64 hardware.
Which makes this a pretty funny article.
I do not have a K3 to confrim. I am hoping to pick one up when it becomes more widely available next month.
Old news. See also:
> Random mumblings of x86_64 developer ... ARM is sloooow
On a related note, SoC companies needs to get their act together and start using the latest arm cores. Even the mid range cores of 1-2 years ago show a huge leap in performance:
https://sbc.compare/56-raspberry-pi-500-plus-16gb/101-radxa-...
I think that's the point being made here. ARM in the 2000s was not known to be fast, now it is.
RISC-V being slow isn't an inherent characteristic of the ISA, it only tells you about the quality of its implementations. And said implementations will only improve if corporations are throwing capitals at it (see: Apple, Qualcomm, etc.)
i. llvm presentation can thrash caches if setup wrong (given the plethora of RISC-V fragmented versions, most compilers won't cover every vanity silicon.)
ii. gcc is also "slow" in general, but is predictable/reliable
iii. emulation is always slower than kvm in qemu
It may seem silly, but I'd try a gcc build with -O0 flag, and a toy unit test with -S to see if the ASM is actually foobar. One may have to force the -mtune=boom flag to narrow your search. Best regards =3
> RISC-V builders have four or eight cores with 8, 16 or 32 GB of RAM (depending on a board)
> The UltraRISC UR-DP1000 SoC, present on the Milk-V Titan motherboard should improve situation a bit (and can have 64 GB ram).
RISC-V SOCs just typically don't support much ram. With the exception of the SG2042 which can take 128GB, but it's expensive, buggy and now old.
So I am sure it's a combination of low ram and low clockspeeds.
The reason that he does not tell us what hardware he is using is because none of these times are for a single system building binutils. I think he is using a mix of systems and then doing some kind of averaging to tell us what a individual system would look like.
For some kind of hardware, all the systems they have would be the fastest that architecture offers, like with i686 I expect. While others are going to be a mix of old and new, like x86-64.
For RISC-V, the latest gen hardware is about as fast as the numbers he quotes for Aarch64. To be clear, the fastest ARM is still faster than the fastest RISC-V. But the numbers he quotes make no sense for something like a SpacemiT K3.
But if you are using RISC-V systems from two years ago in your build cluster, they will as he says be "Sloooow". But that shows how fast RISC-V is improving. It makes no sense to publish this article now.
At least, he should reveal what hardware he is talking about. His chart makes no sense (for most of the platforms).
- mentioned which board had 143 minutes, added info about time on Milk-V Megrez board
- added section 'what we need hw-wise for being in fedora'
- added link to my desktop post to point that it is aarch64, not x86-64
- wording around qemu to show that I use it locally only
That's....slow. What a huge pile of bloat.
Question: While you would want any official arch built natively, maybe an interim stage of emulated vm builds for wip/development/unsupported architectures would still be preferable in this case?
Comparing the tradeoffs: * Packages disabled and not built because of long build times. * Packages built and automated tests run on inaccurately emulated vms (NOT cross compiled). Users can test. It might be broken.
It's an experimental arch, maybe the build cluster could be experimental too?
It is not the ISA, but the implementations and those horrible SDKs which needs to be adjusted for RISC-V (actually any new ISA).
RISC-V needs extremely performant implementations, that on the best silicon process, until then RISC-V _will be_ "slow".
Not to mention, RISC-V is 'standard ISA': assembly writted software is more than appropriate in many cases.
I setup a CopyParty server on a headless RISC-V SBC and was a breeze. Just get the packets, do the thing, move on. Obviously depends on your need but maybe you're not using the right workflow and blame the tools instead.
Anyway, it's hardly surprising that a young ISA with not a 1/1000th of the investment of x86 or ARM has slower chips than them x)
Build time on target hardware matters when you're re-building an entire Linux distribution (25000+ packages) every six months.
Interesting that it's mandated as native - i'm really not sure the logic behind this (i've worked in the embedded world where such stuff is not only normal, but the only choice). I'll do some digging and see if I can find the thought process behind this.
I guess that may be the true use case for 'Open-Source' cores.
That being said, the advertised SPEC2007 scores are close to a M1 in IPC.
Cortex A57 is 14 years old and is significantly faster than the 9 year old Cortex A55 these RISC-V cores are being compared against.
So yes it's many years behind. Many, many years.
Tenstorrent Atlantis (first Ascalon silicon) should ship in Q2/Q3 and be twice as fast. About as fast as Ryzen5. So, about 5 years behind AMD.
But even the K3 has faster AI than Apple Silicon or Qualcomm X Elite.
Current trend-lines suggest ARM64 and RISC-V performance parity before 2030.
It just takes time, people who believe in it and tons of money. Will see where the journey goes, but I am a big risc-v believer
It's a good solid reliable board, but over three years old at this point (in a fast-moving industry) and the maximum 8 GB RAM is quite challenging for some builds.
Binutils is fine, but on recent versions of gcc it wants to link four binaries at the same time, with each link using 4 GB RAM. I've found this fails on my 16 GB P550 Megrez with swap disabled, but works quickly and uses maybe 50 or 100 MB of swap if I enable it.
On the VisionFive 2 you'd need to use `-j1` (or `-j2` with swap enabled) which will nearly double or quadruple the build time.
Or use a better linker than `ld`.
At least the LLVM build system lets you set the number of parallel link jobs separately to the number of C/C++ jobs.
I see, I don't have a Megrez at my desk, only in the build system. I only have P550 as my "workhorse".
PS: I made a typo above - the P550 I was referring to was the SiFive "HiFive Premier P550". But based on your HN profile text, you must've guessed it as much :)
Fedora does not have a way to cross compile packages. The only cross compiler available in repositories is bare-metal one. You can use it to build firmware (EDK2, U-Boot) or Linux kernel. But nothing more.
Then there is the other problem: testing. What is a point of successful build if it does not work on target systems? Part of each Fedora build is running testsuite (if packaged software has any). You should not run it in QEMU so each cross-build would need to connect to target system, upload build artifacts and run tests. Overcomplicated.
Native builds allows to test is distribution ready for any kind of use. I use AArch64 desktop daily for almost a year now. But it is not "4core/16GB ram SBC" but rather "server-as-a-desktop" kind (80 cores, 128 GB ram, plenty of PCI-Express lanes). And I build software on, write blog posts, watch movies etc. And can emulate other Fedora architectures to do test builds.
Hardware architecture slow today, can be fast in the future. In 2013 building Qt4 for Fedora/AArch64 took days (we used software emulators). Now it takes 18 minutes.
You can solve this on a per language basis, but the C/C++ ecosystem is messy. So people use VMs or real hardware of the target arch to not have to think about it
With LLVM existing, cross-compiling is not a problem anymore, but it means you can't run tests without an emulator. So it might just be easier to do it all on the target machine.
There are lots of small issues (libraries or headers not being found, wrong libraries or headers being found, build scripts trying to run the binaries they just built, wrong compiler being used, wrong flags being used, etc.) when trying to cross-compile arbitrary software.
All fixable (cross-compiling entire distributions is a thing), but a lot of work and an extra maintenance burden.
Its a bootstrapping chain of priority. Once a native build regime is set in stone, cross compiling harnesses can be built to exploit the beachhead.
I have saved many a failing projects budget and deadline by just putting the compiler onboard and obviating the hacky scaffolding usually required for reliable cross compiling at the beginning stages of a new architecture project, and I suspect this is the case here too ..
Hugely underappreciated. Someone involved fully understood that "you don't get to the moon by climbing progressively taller trees".
The other two times Arm had great performance were the StrongArm, when it was implemented by DEC people off the Alpha project, and the initial ones, which were quite esoteric and unusually suited to the situation of the late 80s.
Integer MAC doesn't exist, and is also hindered by a design decision not to require more than two source operands, so as to allow simple implementations to stay simple. The same reason also prevents RISC-V from having a true conditional move instruction: there is one but the second operand is hard-coded zero.
FMAC exists, but only because it is in the IEEE 754 spec ... and it requires significant op-code space.
Is it good enough answer?
That'd be ~7 years behind, not 4. Cortex A76 came out in late 2018. Also what benchmarks are you looking at?
> Tenstorrent Atlantis (first Ascalon silicon) should ship in Q2/Q3 and be twice as fast. About as fast as Ryzen5. So, about 5 years behind AMD.
Which Ryzen 5? The first Ryzen 5 came out in 2017, which was a lot more than 5 years ago.
> But even the K3 has faster AI than Apple Silicon or Qualcomm X Elite.
Which isn't RISC-V. Might as well brag about a RISC-V CPU with an RTX 5090 being faster at CUDA than a Nintendo Switch. That's a coprocessor that has nothing to do with the ISA or CPU core.
> Current trend-lines suggest ARM64 and RISC-V performance parity before 2030.
L. O. fucking. L. That's not how this works. That's not how any of this works.
You think the Alibaba C930 CPU is for embedded? 15 SPECint2006 / GHz
Or that the Tenstorrent Ascaclon will be? 18 SPECint2006 / GHz
Even the SpacemiT K3 has better AI performance than an Apple Silicon M4.
And RISC-V chips released this year are 2-4 times faster than last year. RISC-V is not the fastest ISA but it is improving the fastest.
With so many companies backing RISC-V, why would I bet against it?
But if you look at the Intel x86 smartphone chips from about 10 years ago they had to make an ARM to x86 emulator because even the Java apps contained native ARM instructions for performance reasons.
Qualcomm is trying to push their ARM Snapdragon chips in Windows laptops but I don't think they are selling well.
Nvidia could also make RISC-V based chips but where would they go? Nvidia is moving further away from the consumer space to the data center space. So even if Nvidia made a really fast RISC-V CPU it would probably be for the server / data center market and they may not even sell it to ordinary consumers.
Or if they did it could be like the Ampere ARM chips for servers. Yeah you can buy one as an ordinary consumer but they were in the $4,000 range last time I looked. How many people are going to buy that?
That definitely seems to be the case. I think they likely would have more luck with Riscv phones (much less app brand loyalty). or servers (arm in the server has done a lot better than on windows)
For Nvidia, if they made a consumer riscv cpu it would be a gaming handheld/console (Switch 3 or similar) once the AI bubble pops. Before that, likely would be server cpus that cost $10k for big AI systems. Before that, I could see them expanding the role of Riscv in their GPUs (likely not visible to to users).
Honestly, the initial reaction is it sounds like cope, and I know this because I've been saying it for ages to angry reactions. RISC-V looks for all the world like it is designed for competing with the 32 bit Arm ecosystem but that the designers didn't, and still don't, understand what 64 bit Arm is about.
Secondly, it's been necessary to claim such things are forever on the way in order to maintain hype and get software support. Without it you wouldn't see nearly so much Linux buildchain work. (See the open source SuperH implementations for what happens if you admit you don't go for high performance).
Finally though, as process nodes get smaller you can afford to put much more complex blocks in the same area, which can then burst through a series of operations and power off again, many times a second. (Edit to add: of course you know that, but it's still counter intuitive the extent to which it changes things over time. People have things like floating point support in places that not too long ago would have been completely minimalist, and there are some really extreme examples around).
> I'll add that there are companies that COULD make a fast RISC-V implementation.
Again, there is no proof of this until it actually happens. When Qualcomm were trying they wanted to change the spec of RISC-V, and I strongly suspect that is actually necessary.
But if you’re judging an ISA by performance scalability, you generally want to look at single-threaded performance.
I personally found cross compiling Rust easy, as long as you don't have C dependencies. If you have C dependencies it becomes way harder.
This suggests that spending time to upstream cross compilation fixes would be worth it for everyone, and probably even in the C world, 20% of the packages need 80% of the effort.
Why? Because they want something like a $300 CPU and $150 motherboard using standard DDR4/5 DIMMs that is RISC-V or ARM or something not x86 but is faster than x86. The sub $1000 systems that hardware companies make that are RISC-V or ARM chips are low end embedded single board systems that are too slow for these people. The really fast systems are $4000 server level chips that they can't afford. The only company really bringing fast non-x86 CPUs with consumer level pricing is Apple. We can also include Qualcomm but I'm skeptical of the software infrastructure and compatibility since they are relying on x86 emulation for windows.
BTW. Keller is also on the board of AheadComputing — founded by former Intel engineers behind the fabled "Royal Core".
That's useable for many applications, but it's not going to change the world. A lot of "micro PCs" with low power CPUs are well past that now. If that's what Ascalon turns out to be, it will amount to an SBC class device.
The Raspberry Pi 5 results on Geekbench 6 are all over the place. A score between 500 to 900 in single core and a 2000 multi core score.
Radxa 4 is an SBC based around the N100 and it basically gets the same or slightly higher performance as the Raspberry Pi 5.
Meanwhile the i5-9600K gets a score of 1677 in single core, which is 83% of the performance of the entire Raspberry Pi 5 and gets a score of 6199 when using multiple cores, that's 3x the performance.
I'd call this at least "Laptop class" and you even admitted yourself back in 2025 that you're using a processor on that level.
Supposedly happened earlier this year. Tenstorrent says devboards in Q3.
Now we just wait.
Or we just adopt Loongson.
v6-M doesn't (e.g. Cortex-M0+). v7-M and v8-M do allow unaligned access on Normal memory but not on Device memory.
This is the classic conundrum of legacy system redesign - if customers keep demanding every feature of the old system be present, and work the exact same then the new system will take on the baggage it was designed to get rid of.
The new implementation will be slow and buggy by this standard and nobody will use it.
If the CPU doesn't do it software must make many tiny conditional copies which is bad for branch prediction.
This sucks double when you have variable length vector operations... IMO fast unaligned memory accesses should have been mandatory without exceptions for all application-level profiles and everything with vector.
I'm not familiar with RISC-V but from what I've seen here, they're also trying to solve this similarly with vector or bit extraction instructions.
Neither are RISC nor modern.
I have only seen PDP-11 Assembly snippets in UNIX related books, wasn't aware of its alignment requirements.
What is the current fastest platform that isn’t exorbitantly expensive? Not upcoming releases, but something I can actually buy.
I check in every 3-6 months but the situation hasn’t changed significantly yet.
The cores are in my experience moderately fast at most. Note that there are a lot of licencing options and I think some are speed-capped - but I don't think that applies to IFL - a standard CPU licence-restricted to only run linux.
Ironically, its SoC (spacemiT K1) is slower than the JH7110 used in the first mass-produced RISC-V SBC, VisionFive 2.
But unlike JH7110, it has vector 1.0, making it a very popular target.
Of course, none of these pre-RVA23 boards will be relevant anymore, once the first development boards with RVA23-compatible K3 ship next month.
These are also much faster than anything RISC-V currently purchasable. Developers have been playing with them for months through ssh access.
SpacemiT K3 is 2010 Macbook performance single-core, 2019 Macbook Air multi-core, and better than M4 Apple Silicon for AI.
So I guess it depends on what you are going to do with it.
E.g. M4 total system memory bandwidth is 120GB/s whereas K4 is 51GB/s, single core memory bandwidth is 100-120GB/s vs ~30GB/s. M4 has 10 CPU cores and neural engine with 16 cores whereas K3 has 8 CPU cores and 8 "AI" cores, K3 clock frequency is almost half the clock frequency in M4 etc. etc.
But anyway thanks for sharing, always good to learn about new hardware.
This is primarily because core is primarily a teaching ISA. One of the best parts about RiscV is that you can teach a freshman level architecture class or a senior level chip building project with an ISA that is actually used. Anything powerful to run (a non built from source manually) linux will support a profile that bundles all the commonly needed instructions to be fast.
https://five-embeddev.com/riscv-bitmanip/1.0.0/bitmanip.html
I can see quite a few items on that list that imnsho should have been included in the core and for the life of me I can't see the rationale behind leaving them out. Even the most basic 8 bit CPU had various shifts and rolls baked in.
Same could be said of MIPS.
My understanding is the RISC-V raison d'etre is rather avoidance of patented/copywritten designs.
In spite of the currently mediocre RISC-V implementations, RISC-V seems to have more of a future and isn't clouded by ISA IP issues, as you note.
That doesn't necessarily make it all that great for industrial use, does it?
> One of the best parts about RiscV is that you can teach a freshman level architecture class or a senior level chip building project with an ISA that is actually used.
You can also do that with Intel MCS-51 (aka 8051) or even i960. And again, having an ISA easily implementable "on a knee" by a fresh graduate doesn't says anything about its other technical merits other than being "easily implementable (when done in the most primitive way possible)".
Why did it fall to them to do it? Impressive that he did, but it shouldn't have been necessary.
https://wren.wtf/hazard3/doc/#extension-xh3bextm-section
There are also four other custom extensions implemented.
My bubble includes a number of SBCs and embedded boards from Advantech, frequently using Ryzen embedded (V1000 class) CPUs.
SBC is too vague I suppose. Past the Raspberry Pi form factor SBC class, there are many* SBC vendors with Core i5-1340P and similar CPUs today. That's a 2023 device, and just past a 2018 i5-9600K, aligning well with what I claimed.
In 2025+, such a CPU is not a desktop class device, and is sufficient only in low cost laptops (but in much lower power form.) A MacBook Neo A18, for example, is considerably better than a i5-9600K.
It would be great if Tentorrent actually yields such a product, and if, based on later performance projections that appeared in late 2025, Ascalon is actually faster, but, as I said, the world will not change much. RISC-V developers will appreciate compiling like it's 2019, but that's as far as it will go.
* LattePanda Sigma, ASROCK NUC, DFROBOT, Premio and many NAS and industrial devices.
AVX shift and shuffle is mostly limited to 128 bits unfortunately for historical reasons (even for 256-bit instructions) and hardware support for AVX512/AVX10 where they fixed that is a complete mess so it's hard to rely on when you care about backwards compatibility for consumer devices, e.g. in game development.
RISC-V vector has excellent mask/shuffle/permute but the performance in real silicon can be... questionable. See the timings for vrgather here for example: https://camel-cdr.github.io/rvv-bench-results/spacemit_a100/...
For working with packed data structures where fields are irregular/non-predictable/dependent on previous fields etc. unaligned load/store is a godsend. Last time I worked on a custom DB engine that used these patterns the generated x86 code was so much nicer than the one for our embedded ARM cores.
It is quite likely that not allowing the misaligned access was also influenced by PDP-11.
Regarding silicon implementations, consider that 1) you can synthesize it from HDL/RTL designs using modern CAD tools, and 2) MIPS was originally designed to be simple enough for grad students to implement with the primitive CAD tools of the 1980s (basically semi-manual layout).
> In the current version of this architecture specification, TLB refill and consistent maintenance between TLB and page tables are still [sic] all led by software.
https://loongson.github.io/LoongArch-Documentation/LoongArch...
Especially the lack of integer overflow detection is a choice of great stupidity, for which there exists no excuse.
Detecting integer overflow in hardware is extremely cheap, its cost is absolutely negligible. On the other hand, detecting integer overflow in software is extremely expensive, increasing both the program size and the execution time considerably, because each arithmetic operation must be replaced by multiple operations.
Because of the unacceptable cost, normal RISC-V programs choose to ignore the risk of overflows, which makes them unreliable.
The highest performance implementations of RISC-V from previous years were forced to introduce custom extensions for indexed addressing, but those used inefficient encodings, because something like indexed addressing must be in the base ISA, not in an extension.
Since my previous attempt to measure the impact of trap on signed overflow didn't seem to have moved your position one bit, I thought I'd give it a go in the most representable way I could think of:
I build the same version of clang on a x86, aarch64 and RISC-V system using clang. Then I build another version with the `-ftrapv` flag enabled and compared the compiletimes of compiling programs using these clang builds running on real hardware:
runtime: x86 | aarch64 | RISC-V (RVA23)
Zen1 | A78 A55* | X100 A100 !!! all cores clocked to about 2.2GHz, Zen1 can reach almost 4GHz
clang A: 3.609±0.078 | 4.209±0.050 9.390±0.029 | 5.465±0.070 11.559±0.020
clang-ftrapv A: 3.613±0.118 | 4.290±0.050 9.418±0.056 | 5.448±0.060 11.579±0.030
clang B: 8.948±0.100 | 10.983±0.188 22.827±0.016 | 13.556±0.016 28.682±0.023
clang-ftrapv B: 8.960±0.125 | 11.099±0.294 22.802±0.039 | 13.511±0.018 28.741±0.050
As you can see, once again the overhead of -ftrapv is quite low.Suprizinglt the -ftrapv overhead seems the highest on the Cortex-A78. My guess is that this because clang generates a seperate brk with unique immediate for every overflow check, while on RISC-V it always branches to one unimp per function.
Please tell me if you have a better suggestion for measuring the real world impact.
Or heck, give me some artificial worst case code. That would also be an interesting data point.
Notes:
* The format is mean±variance
* Spacemit X100 is a Cortex-A76 like OoO RISC-V core and A100 an in-order RISC-V core.
* I tried to clock all of the cores to the same frequency of about 2.2GHz. *Except for the A55, which ran at 1.8GHz, but I linearly scaled the results.
* Program A was the chibicc (8K loc) compiler and program B microjs (30K loc).
binary size:
x86 aarch64 RISC-V
clang: 212807768 216633784 195231816
clang-ftrapv: 212859280 216737608 195419512
increase: 0.24% 0.047% 0.09%Most languages don't care about integer overflow. Your typical C program will happily wrap around.
If I really want to detect overflow, I can do this:
add t0, a0, a1
blt t0, a0, overflow
Which is one more instruction, which is not great, not terrible.And what did I find? Yep that code is right from the manual for unsigned integer overflow.
For signed addition if you know one of the signs (eg it’s a compile time constant) the manual says
addi t0, t1, +imm
blt t0, t1, overflow
But the general case for signed addition if you need to check for overflow and don’t have knowledge of the signs add t0, t1, t2
slti t3, t2, 0
slt t4, t0, t1
bne t3, t4, overflow
From what I’ve read most native compiled code doesn’t really check for overflows in optimised builds, but this is more of an issue for JavaScript et al where they may detect the overflow and switch the underlying type? I’m definitely no expert on this.The correct sequence of instructions is given in the RISC-V documentation and it needs more instructions.
"Integer overflow" means "overflow in operations with signed integers". It does not mean "overflow in operations with non-negative integers". The latter is normally referred as "carry".
The 2 instructions given above detect carry, not overflow.
Carry is needed for multi-word operations, and these are also painful on RISC-V, but overflow detection is required much more frequently, i.e. it is needed at any arithmetic operation, unless it can be proven by static program analysis that overflow is impossible at that operation.
It's easy to believe you're replying to something that has an element of hyperbole.
It's hard to believe "just do 2x as many instructions" and "ehhh who cares [i.e. your typical C program doesn't check for overflow]", coupled to a seemingly self-conscious repetition of a quip from the television series Chernobyl that is meant to reference sticking your head in the sand, retire the issue from discussion.
this just isn't true. both addition and multiplication can check for overflow in <2 instructions.
I know this is a very negative take. I don't try to hide my pro-Power ISA bias, but that doesn't mean I wouldn't like another choice. So far, however, I've been repeatedly disappointed by RISC-V. It's always "five or six years" from getting there.
SPARC (formerly called Berkeley RISC) and MIPS were pioneers that experimented with various features or lack of features, but they were inferior from many points of view to the earlier IBM 801.
The RISC ISAs developed later, including ARM, HP PA-RISC and IBM POWER, have avoided some of the mistakes of SPARC and MIPS, while also taking some features from IBM 801 (e.g. its addressing modes), so they were better.
The x86-64 is a dog's breakfast of features. But due to its widespread use, compiler writers make the effort to create compilers that optimize for its quirks.
Itanium hardware designers were expecting the compiler writers to cater for its unique design. Intel is a semi company. As good as some of their compilers are, internally they invested more in their biggest seller and the Itanium never got the level of support that was anticipated at the outset.
You're saying ISA design does have implementation performance implications then? ;)
> There's no one that expects it'll be hard to optimize for
[Raises hand]
> There are at least 2 designs that have taped out in small runs and have high end performance.
Are these public?
Edit: I should add, I'm well aware of the cultural mismatch between HN and the semi industry, and have been caught in it more than a few times, but I also know the semi industry well enough to not trust anything they say. (Everything from well meaning but optimistic through to outright malicious depending on the company).
> There's no one that expects it'll be hard to optimize for
No one who is an expert in the field, and we (at Red Hat) talk to them routinely.
If a CPU built in 1985 with a grand total of 26 000 transistors could afford it, I am pretty sure that anything built in this century could afford it too.
You'd be excluding many small CPUs which exist within other chips running very specialized code.
As profiles mandate these instructions anyway, there's no good reason to complicate the most basic RISC-V possible.
RISC-V is the ISA for everything, from the smallest such CPUs to supercomputers.
To the best of my knowledge (and Google-fu), 26K really isn't a lot of transistors for an embedded MCU - at least not a fully-featured 32-bit one comparable to a minimal RISC-V core. An ARM Cortex M0, which is pretty much the smallest thing out there, is around 10K gates => around 40K transistors. This is also around the same size as a minimal RISC-V core AFAICT.
The ARM core has a shifter, though.
The sequence of instructions given above is incorrect, it does not detect integer overflow (i.e. signed integer overflow). It detects carry, which is something else.
The correct sequence, which can be found in the official RISC-V documentation, requires more instructions.
Not checking for overflow in C programs is a serious mistake. All decent C compilers have compilation options for enabling checking for overflow. Such options should always be used, with the exception of the functions that have been analyzed carefully by the programmer and the conclusion has been that integer overflow cannot happen.
For example with operations involving counters or indices, overflow cannot normally happen, so in such places overflow checking may be disabled.
> Maybe looking at libgmp could be interesting, though I suspect absolute numbers will not be meaningful, and there's no baseline to compare them to.
Realistically, nobody cares about BigInt addition performance, considering there is no GMP implementarion using SIMD, or even any using dependency breaking to get beyond 64-bit per cycle.
I whipped up a quick AVX-512 implementation that was 2x faster than libgmp on Zen4 (which has 256-bit SIMD ALUs). On RISC-V you'd just use RVV to do BigInt stuff.
The SpacemiT K3 (https://www.spacemit.com/products/keystone/k3 https://www.cnx-software.com/2026/01/23/spacemit-k3-16-core-...) is the one everyone is waiting for. We have one in house (as usual, cannot discuss benchmarks, but it's good). Unfortunately I don't think there is anyone reputable offering pre-orders yet.
add t0, t1, t2
addw t3, t1, t2
bne t0, t3, overflow add eax, ecx
jo overflowCode size is a benefit for x86-64 however - no one is arguing that - but you have to trade that against the difficulty of instruction decoding.
Except it isn't. Code isn't one single pattern repeating again and again; on large enough bodies of code, RISC-V is the most dense, and it's not even close.
Every CPU changes the cycles it takes for many instructions, adds new instructions etc.
Out of order execution is a huge dividing line in performance for a reason. The CPU itself needs to figure these things out to minimize memory latency, cache latency, pipelining, prefetching and all that stuff.
Maybe compilers would get better, maybe Itanium would have needed some redesign, after all it isn't as if a Raptor Lake Refresh execution units are the same as an Xeon Nocona, yet both execute x64 instructions.
First, the code claims to be returning "unsigned long" from each of these functions, but the value will only ever be 0 or 1 (see [1]). The code is actually throwing away the result and just returning whether overflow occurred. If we take unsigned long *c as another argument to the function, so that we actually keep the result, we end up having to issue an extra instruction for multiplication (see [2]; I'm ignoring the sd instruction since it is simply there to dereference the *c pointer and wouldn't exist if the function got inlined).
Second, this is just unsigned overflow detection. If we do signed overflow detection, now we're up to 5 instructions for add and mul (see [3]). Considering that this is the bigger challenge, it compares quite unfavorably to architectures where this is just 2 instructions: the operation itself and a branch against a condition flag.
[1]: https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins...
There are many chips in the market that do embed 8051s for janitorial tasks, because it is small and not legally encumbered. Some chips have several non-exposed tiny embedded CPUs within.
RISC-V is replacing many of these, bringing modern tooling. There's even open source designs like SERV that fit in a corner of an already small FPGA, leaving room for other purposes.
(Although I do have to eat my words here - I didn't check that Wikipedia page, and it does actually list a ~6K RISC-V core! It's an experimental academic prototype "made from a two-dimensional material [...] crafted from molybdenum disulfide"; I don't know if that construction might allow for a more efficient transistor count and it's totally impractical - 1KHz clock speed, 1-bit ALU, etc. - for almost any purpose, but it is technically a RISC-V implementation significantly smaller than 26K)
That sounds like a microcoded RISC-V implementation, which can really be done for any ISA at the extreme expense of speed.
This is actually kind of counter to your point. The really tiny micro-controllers from the 80s only had 224 bits of registers. RV32E is at least twice that (16 registers*32 bits), and modern mcus generally use 2-4kbs of sram, so the overhead of a 32 bit barrel shifter is pretty minimal.
You're super lucky to have your hands on one!
Historically RISC instructions were 1:1 with CPU operations, in theory allowing the compiler to better optimise logic, but this isn't really true anymore. High performance ARM CPUs use µOPs and macro-op fusion, though not to the extent of x86 CPUs.
This document from ARM has some details on how they use micro-ops, https://developer.arm.com/documentation/102160/latest
Maybe other CPU's have it as well, though I do not have enough information on that.
The RISC-V ecosystem being handicapped by backwards compatibility does not make sense at this point.
Every new RISC-V board is going to be RVA23 capable. Now is the time to draw a line in the sand.
Nobody really forces you to use x64 if you don't like it, just as nobody forced you to use Itanium — which Intel famously failed to "shove down the customers' throats" btw.
I only use it for microcontrollers and it's really nice there. But yeah I can imagine it doesn't perform well on bigger stuff. The idea of risc was to put the intelligence in the compiler though, not the silicon.
Depends on what the instruction does. If it goes through a four-loads-four-stores chain that VAXen could famously do (with pre- and post-increments), then sure, this makes it impossible to implements such ISA in a multiscalar, OOO manner (DEC tried really, really hard and couldn't do it). But anything that essentially bit-fiddles in funny ways with the 2 sets of 64 bits already available from the source registers, plus the immediate? Shove it in, why not? ARM has bit shifted immediates available for almost every instruction since ARMv1. And RISC-V also finally gets shNadd instructions which are essentially x86/x64's SIB byte, except available as a separate instruction. It got "andn" which, arguably, is more useful than pure NOT anyway (most uses of ~ in C are in expressions of "var &= ~expr..." variety) and costs almost nothing to implement. Bit rotations, too, including rev8 and brev8. Heck, we even got max/min instructions in RISC-V because again, why not? The usage is incredibly widespread, the implementation is trivial, and makes life easier both for HW implementers (no need to try to macrofuse common instruction sequences) and the SW writers (no need to neither invents those instruction sequences and hope they'll get accelerated nor read manufacturers datasheets for "officially" blessed instruction sequences).
Itanium did this mistake. Sure, compilers are much better now, but still dynamic scheduling beats static one for real-world tasks. You can (almost perfectly) statically schedule matrix multiplication but not UI or 3D game.
Even GPUs have some amount of dynamic scheduling now.
Less than 7 years from ratification of the initial RV{32,64}GC spec.
Less than 5 years from the first mass-produced roughly original Raspberry Pi level $100 SBC: AWOL Nezha, shipped June 2021.
Nope. See https://github.com/llvm/llvm-project/issues/110454 which was linked in the first issue. The spec authors have managed to made a mess even here.
Now they want to introduce yet another (sic!) extension Oilsm... It maaaaaay become part of RVA30, so in the best case scenario it will be decades before we will be able to rely on it widely (especially considering that RVA23 is likely to become heavily entrenched as "the default").
IMO the spec authors should've mandated that the base load/store instructions work only with aligned pointers and introduced misaligned instructions in a separate early extension. (After all, passing a misaligned pointer where your code does not expect it is a correctness issue.) But I would've been fine as well if they mandated that misaligned pointers should be always accepted. Instead we have to deal the terrible middle ground.
>atomic memory operations are made mandatory in Ziccamoa
In other words, forget about potential performance advantages of load-link/store-conditional instructions. `compare_exchange` and `compare_exchange_weak` will always compile into the same instructions.
And I guess you are fine with the page size part. I know there are huge-page-like proposals, but they do not resolve the fundamental issue.
I have other minor performance-related nits such `seed` CSR being allowed to produce poor quality entropy which means that we have bring a whole CSPRNG if we want to generate a cryptographic key or nonce on a low-powered micro-controller.
By no means I consider myself a RISC-V expert, if anything my familiarity with the ISA as a systems language programmer is quite shallow, but the number of accumulated disappointments even from such shallow familiarity has cooled my enthusiasm for RISC-V quite significantly.
If you'd start from a clean sheet today you'd probably end up with a somewhat bigger base page size. Not hugely larger though, as that wastes a lot of memory for most applications. Maybe 16k like some ARM chips use?
X86-64 also has “profiles” which tell you what extensions should be available. There is x86-64v1 and x86-64v4 with v2 and v3 in the middle.
RVA23 offers a very similar feature-set to x86-64v4.
You do not end up with a mess of extensions. You get RVA23. Yes, RVA23 represents a set of mandatory extensions. The important thing is that two RVA23 compliant chips will implement the same ones.
But the most important point is that you cannot “just use x86-64”. Only Intel and AMD can do that. Anybody can build a RISC-V chip. You do not need permission.
No, anybody can’t build a RISC-V chip. That’s the same mistake OSS proponents make. Just because something is open source doesn’t mean bugs will be found. And just because bugs are found doesn’t mean they will be fixed. The vast majority of people can’t do either.
The number of people who can design a chip implementation of the RISC-V ISA is much, much smaller, and the number who can get or own a FAB to manufacture the chips smaller still. You don’t need permission to use the ISA, but that is not the only gate.
2. Also, fundamentally all modern CPUs are still 64-bit version of 80386. MMU, protection, low level details are all same.
Uh, because you can't? It's not open in any meaningful sense.
The ISA is open so there's no greedy corporation trying to upsell you. I mean there's an implementation and die area cost for each extension but it's not being set at an artificial level by a monopolist.
RyanAir is about exploiting consumers, with bait-and-switch and shitty terms and conditions.
RISC-V's modularity is about giving choice to hardware designers, so they can pick and choose just those features that their solution needs, and even allow for custom extensions.
RISC-V's modularity is for academia. 1) for education, where students learn/use/work on simple processors, 2) for research in new types of hardware and extensions, where ease of implementation or ease of creating a custom extension is important.
And where it actually mattered they did not introduce a separate extension. Integer division is significantly more complex than multiplication, so it may make sense for low-end microcontrollers to implement in hardware only the latter.
I would be ok with that if it was a valid analogy.
It is valid in microcontroller land. There, the chip and the software are provided by the same party. So you can select for exactly the RISC-V features you need and save yourself some silicon. That sounds like a win to me.
At the application level, like a server or a desktop, that would be a disaster because I get my hardware and software from different people. How do the software guys know what hardware to target? Well, that is exacly why RVA23 exists.
What does RVA23 mean? It is the RISC-V "Application" profile. It allows you to build software to a single hardware target and trust that hardware makers will target the same proifle. RVA23 is like saying x86-64v4. Both are simple names for a long list of extensions (flags) and assumptions that you expect the hardware to honour. So, when Ubuntu 26.04 says it requires RVA23, it means that all the software built on it can assume those features. No a la carte.
The reason RVA23 is geting so much attention is that it has essentially the same feature set as modern ARM64 or x86-64. Software will be able to target this profile for a long time. There may be a new profile in a few years time, like RVA30, but hardware that implements that will still run RVA23 software (just as x86-64v4 hardware will run x86-64v1 software). Hardware built for profiles before RVA23 may be missing features modern applications expect.
I guess you could say that RVA23 is British Airways Business Class.
If you really want to support hardware designed before RVA23, almost everything you would want to run pre-built software on supports RVA20. And again, your RVA20 stuff will run fine on RVA23 hardware (but with fewer features--like no vectors). So maybe no in-flight meal, but it will get you there.
As for `seed`, if you're running on a microcontroller you can just look up the data sheet to see if it's seed entropy is sufficient. By the time you get to CPUs where portable code is important a CSPRNG is probably fine.
I agree about page size though. Svnapot seems overly complicated and gives only a fraction of the advantages of actually bigger pages.
It's a terrible attitude to have towards programmers, but looking at misaligned ops, I guess we can see a pattern from RISC-V authors here.
Most programmers do not target a concrete microcontroller and develop every line of code from scratch. They either develop portable libraries (e.g. https://docs.rs/getrandom) or build their projects using those libraries.
The whole raison d'être of an ISA is to provide a portable contract between hardware vendors and programmers . RISC-V authors shirk this responsibility with "just look at your micro specs, lol" attitude.
aka, Zicclsm / RVA23 are entirely-useless as far as actually getting to make use of native misaligned loads/stores goes.
It's just a bizarre design choice. I understand wanting to get rid of condition flags, but not replacing them with nothing at all.
EDIT: It seems the same choice was made by MIPS, which is a clear inspiration for RISC-V.
1. 64 bit signed math is a lot less overflow vulnerable than the 16/32 bit math that was extremely common 20 years ago
2. For the BigInt use-case, the Riscv design is pretty sensible since you want the top bits, not just presence of overflow
3. You can do integer operations on the FPU (using the inexact flag for detecting if rounding occurred).
4. Adding overflow detecting instructions can easily be done in an extension in the future if desired.
> The number of people who can design a chip implementation
Thankfully you don't have to start from scratch. There are loads of open source RISC-V chip implementations you can start from.
> get or own a FAB to manufacture the chips
There is always FPGAs and also this:
Yes, they can. My point is that nobody needs to give you permission. You can pretend that does not matter but China is about to educate us about what this means rather dramatically in the next few years.
And India is building RISC-V chips. And Europe is building RISC-V chips. Tenstorrent started in Canada (building RISC-V chips).
> the number who can get or own a FAB to manufacture the chips
Really? Almost nobody owns fabs and yet there are a multitude of chip makers. Getting access to a fab requires only money. It has nothing to do with the ISA or your skills. TSMC can make RISC-V chips just fine and already do. In some places, like China, RISC-V chips may be at the front of the line.
> The number of people who can design a chip implementation of the RISC-V ISA
Anybody can build a RISC-V chip. Build one yourself: https://github.com/tscheipel/HaDes-V
Every electrical engineer is going to know how to design a RISC-V chip. But you could also be an intelligent garbage man and design a RISC-V chip in your spare time using only open source materials. You can even tape it out.
"But that is only a 32 bit microcontroller!", you might say. Sure. But the skills to build RISC-V are going to propogate. Of course, that does not mean that everybody in the world is going to figure out how to build chips. That is clearly not my point. They will still be built primarily by a select few. But that is not unique to RISC-V by any stretch. In fact, less so.
The hard part about building a chip from scratch is not the ISA. You think that a world-class engineer working with ARM64 or amd64 today cannot design a RISC-V chip? That is like saying a carpenter building oak cabinets lacks the skills to make them with maple.
And since it is the same amount of work to start fresh regardless of ISA, why not start with RISC-V?
Except you do not have to start fresh with RISC-V because there are many, and will be many, many more, open designs to study and start with. Here is a 64 bit chip that implements the very latest RISC-V vector extensions:
https://github.com/tenstorrent/riscv-ocelot
Which, by the way, means that although most won't, anybody can build a RISC-V chip.
The RISC-V world will look like ARM. Most chip makers will license the core design off somebody else. But there will be more of those "somebody elses" to choose from. And there will be more people who choose to design their own silicon. Meta just bought Rivos. What for do you think? And they did not have to talk to ARM about it.
In some of these, less silicon means less power means more better. Like that last example.
About 1/3 of the opcode space is used currently so there's a decent amount of space left.
See, for example, https://www.pingcap.com/blog/transparent-huge-pages-why-we-d...
I know for example that Berkley when thinking pre-RISC-V that they had a deal with Intel about using x86-64 for research. But they were not able to share the designs.
CISC ISAs were historically designed for humans writing assembly so they have single instructions with complex behaviour and consequently very high instruction density.
RISC was designed to eliminate the complex decoding logic and replace it with compiler logic, using higher throughput from the much reduced decoding logic (or in some cases no decoding at all) to offset the increased number of instructions. Also the transistors that were used for decoding could be used for additional ALUs to increase parallelism.
So RISC by its nature is more verbose.
Does the tradeoff still make sense? Depends who you ask.
Currently, RISC-V holds the crown of code density in both 64 and 32 bit.
On 32bit, thumb2 is a little behind. On 64bit, x86-64 is not even close, and ARMv8/v9 are even worse.
"Maybe if I keep repeating it, it'll be true."
If you're using OSS it doesn't really matter as you can compile it for whatever you want.
Almost all software I encountered - including Windows 10 and precompiled Debian 13 - needs only SSE4.2, essentially mid-2000s ISA. Intel produced until very recently (early 2020s) Celeron CPUs which did not even support AVX.
Right but it doesn't guarantee that anything is unreasonably slow does it? I am free to make an RVA23 compliant CPU with a div instruction that takes 10k cycles. Does that mean LLVM won't output div? At some point you're left with either -mcpu=<specific cpu> and falling back to reasonable assumptions about the actual hardware landscape.
Do ARM or x86 make any guarantees about the performance of misaligned loads/stores? I couldn't find anything.
However, the spec has the explicit note:
> Even though mandated, misaligned loads and stores might execute extremely slowly. Standard software distributions should assume their existence only for correctness, not for performance.
Which was a mistake. As you said any instruction could be arbitrarily slow, and in other aspects where performance recommendations could actually be useful RVI usually says "we can't mandate implementation".
Indeed one can make any instruction take basically-forever, but I think it's a fairly reasonable expectation that all supported hardware instructions/behaviors (at least non-deprecated ones) are not slower than a software implementation (on at least some inputs), else having said instruction is strictly-redundant.
And if any significant general-purpose hardware actually did a 10k-cycle div around the time the respective compiler defaults were decided, I think there's a good chance that software would have defaulted to calling division through a function such that an implementation can be picked depending on the running hardware. (let's ignore whether 10k-cycle-division and general-purpose-hardware would ever go together... but misaligned-mem-ops+general-purpose-hardware definitely do)
How is that different for RISC-V?
> I think it's a fairly reasonable expectation that all supported hardware instructions/behaviors (at least non-deprecated ones) are not slower than a software implementation
I agree! So just use misaligned loads if Zicclsm is supported. As you observed there's a feedback loop between what compilers output and what gets optimised in hardware. Since RVA23 hardware is basically non-existent at the moment you kind of have the opportunity to dictate to hardware "LLVM will use misaligned accesses on RVA23; if you make an RVA23 chip where this is horribly slow then people will laugh at you and assume it's some sort of silicon defect".
In what way are RISC-V profiles debatable? Canonical is spearheading the RVA23-as-a-default movement and so far, it seems that there are no heavy objections towards that effort (beyond the usual "Canonical sucks" shtick that you see in every discussion involving Canonical)
This isn't easy but it can be done (and it is being done on x86, despite constantly evolving variations of AVX).
So, you can compile your RISC-V software to require the equivalent of AVX and it will run on whatever size vectors the hardwre supports.
So, on x86-64, if I write AVX2 software and run it on AVX512 capable hardware, I am leaving performance on the table. But if I write software that uses AVX512, it will not run on hardware that does not support those extensions (flags).
On RISC-V, the same binary that uses 256 bit vectors on hardware that only supports that will use 512 bit vectors on hardware that supports it, or even 1024 bit vectors on hardware like the A100 cores of the SpacemiT K3.
So, I guess X86-64 is is the RyanAir of processors.
Are you sure, especially considering China?
I doubt there is any legal barrier, because there are a few existing projects with x86 cores on an FPGA, as well as some SoCs. Here's a 486: https://opencores.org/projects/ao486
The "G" extension for everything you want to run shrink-wrapped binaries on a standard OS has been there since the May 7 2014 "User Level ISA, Version 2.0", which is before RISC-V started to be promoted outside of Berkeley e.g. at Hot Chips 26 in August 2014, and the first RISC-V workshop in January 2015 in Monterey.
The name "G" has morphed into now (along with the C extension) being called "RVA20", which led to "RVA22" and "RVA23", but the principle is unchanged.
"An integer base plus these four standard extensions (“IMAFD”) is given the abbreviation “G” and provides a general-purpose scalar instruction set. RV32G and RV64G are currently the default target of our compiler toolchains."
pp 4-5 in
https://www2.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-...
As for general purpose processors, RISC-V has always had the idea of profiles (mandatory set of extensions). Just look at the G extension, which mandated floating point, multiply/division, atomics, ... things that you expect to see on user-facing general-purpose processors.
> the belated admission that maybe we shouldn't have everything as optional extras
That's why I disagree with the above claim.
(1) The optionality is a feature of RISC-V and it allows RISC-V to shine on different ecosystems. The desktop isn't everything.
(2) RISC-V has always addressed the fear of fragmentation on the desktop by using profiles.
The "G" extension for everything you want to run shrink-wrapped binaries on a standard OS has been there since the May 7 2014 "User Level ISA, Version 2.0", which is before RISC-V started to be promoted outside of Berkeley e.g. at Hot Chips 26 in August 2014, and the first RISC-V workshop in January 2015 in Monterey.
The name "G" has morphed into now (along with the C extension) being called "RVA20", which led to "RVA22" and "RVA23", but the principle is unchanged.
"An integer base plus these four standard extensions (“IMAFD”) is given the abbreviation “G” and provides a general-purpose scalar instruction set. RV32G and RV64G are currently the default target of our compiler toolchains."
pp 4-5 in
https://www2.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-...
As for opencores, yes you can design them, but do any companies making commercial products sell them?
This works quite well in practice. As to leaving performance on the table, it seems RVV has some egregious performance differences/cliffs. For example, should we use vrgather (with what LMUL), or interesting workarounds such as widening+slide1, to implement a basic operation such as interleaving two vectors?
Use Zvzip, in the mean time:
zip: vwmaccu.vx(vwaddu.vv(a, b), -1, b), or segmented load/store when you are touching memory anyways
unzip: vsnrl
trn1/trn2: masked vslide1up/vslide1down with even/odd mask
The only thing base RVV does bad in those is register to register zip, which takes twice as many instructions as other ISAs. Zvzip gives you dedicated instructions of the above.
Great that you did a gap analysis [1]. I'm curious if one of the inputs for that was the list of Highway ops [2]?
[1]: https://gist.github.com/camel-cdr/99a41367d6529f390d25e36ca3... [2]: https://github.com/google/highway/blob/master/g3doc/quick_re...
RISC-V hardware with slow misaligned mem ops does exist to non-insignificant extent, and it seems not enough people have laughed at them, and instead compilers did just surrender and default to not using them.
> As you observed there's a feedback loop between what compilers output and what gets optimised in hardware.
Well, that loop needs to start somewhere, and it has already started, and started wrong. I suppose we'll see what happens with real RVA23 hardware; at the very least, even if it takes a decade for most hardware to support misaligned well, software could retroactively change its defaults while still remaining technically-RVA23-compatible, so I suppose that's good.
Only U74 and P550, old RV64GC CPUs.
SiFive's RVA23 cores have fast misaligned accesses, as do all THead and SpacemiT cores.
I can't imagine that all the Tenstorrent and Ventana and so forth people doing massively OoO 8-wide cores won't also have fast misaligned accesses.
As a previous poster said: if you're targeting RVA23 then just assume misaligned is fast and if someone one day makes one that isn't then sucks to be them.
Also Kendryte K230 / C908, but only on vector mem ops, which adds a whole another mess onto this.
I'd hope all the massive OoO will have fast misaligned mem ops, anything else would immediately cause infinite pain for decades.
But of course there'll be plenty of RVA23 hardware that's much smaller eventually too, once it becomes a general expectation instead of "cool thing for the very-top-end to have".
I do agree that it'd be reasonable to just assume fast misaligned ops, but for whatever reason gcc and clang just don't, and that's what we have for defaults.
LLVM and GCC developers clearly disagree with you. In other words, re-iterating the previously raised point: Zicclsm is effectively useless and we have to wait decades for hypothetical Oilsm.
Most programmers will not know that the misaligned issue even exists, even less about options like -mno-strict-align. They just will compile their project with default settings and blame RISC-V for being slow.
RISC-V could've easily avoided all this mess by properly mandating misaligned pointer handling as part of the I extension.
> RISC-V could've easily avoided all this mess by properly mandating misaligned pointer handling as part of the I extension.
Rather hard to mandate performance by an open ISA. Especially considering that there could actually be scenarios where it may be necessary to chicken-bit it off; and of course the fact that there's already some questionability on ops crossing pages, where even ARM/x86 are very slow.
I would be fine with any of the following 3 approaches:
1) Mandate that store/loads do not support misaligned pointers and introduce separate misaligned instructions (good for correctness, so its my personal preference).
2) Mandate that store/loads always support misaligned pointers.
3) Mandate that store/loads do not support misaligned pointers unless Zicclsm/Oilsm/whatever is available.
If hardware wants to implement a slow handling of misaligned pointers for some reason, it's squarely responsibility of the hardware's vendor. And everyone would know whom to blame for poor performance on some workloads.
We are effectively going to end up with 3, but many years later and with a lot of additional unnecessary mess associated with it. Arguably, this issue should've been long sorted out in the age of ratification of the I extension.
Indeed extremely sad that Zicclsm wasn't a thing in the spec, from the very start (never mind that even now it only lives in the profiles spec); going through the git history, seems that the text around misaligned handling optionality goes all the way back to the very start of the riscv/riscv-isa-manual repo, before `Z*` extensions existed at all.
More broadly, it's rather sad that there aren't similar extensions for other forms of optional behavior (thing that was recently brought up is RVV vsetvli with e.g. `e64,mf2`, useful for massive-VLEN>DLEN hardware).
I wouldn't call it "waste". Moreover, it's fine for misaligned instructions to use a wider encoding or be less rich than their aligned counterparts. For example, they may not have the immediate offset or have a shorter one. One fun potential possibility is to encode the misaligned variant into aligned instructions using the immediate offset with all bits set to one, as a side effect it also would make the offset fully symmetric.
In terms of correctness, there's also the possibility of partially-misaligned ops (e.g. an 8B load with 4B alignment, loading two adjacent int32_t fields) so you're not handling everything with correct faults anyways.
No, it was released to customers in June 2021, almost five years ago.
https://www.sifive.com/press/sifive-performance-p550-core-se...
It has take a while for this core to appear in an SoC suitable for SBCs, as Intel was originally announced as doing that and got as far as showing a working SoC/Board at the Intel Innovation 2022 event in September 2022.
Someone who attended that event was able to download the source code for my primes benchmark and compile and run it, at the show, and was kind enough to send me the results. They were fine.
For reasons known only to Intel, they subsequently cancelled mass production of the chip.
ESWIN stepped up and made the EIC7700X, as used in the Milk-V Megrez and SiFive HiFive Premier P550, which did indeed ship just over a year ago.
But technically we could have had boards with the Intel chip three years ago.
Heck we should have had the far better/faster Milk-V Oasis with the P670 core (and 16 of them!) two years ago. Again, that was business/politics that prevented it, not technology.
Ah, okay. (still, like, at least a couple decades newer than the last x86-64 chip with slow unaligned mem ops, if such ever existed at all? Haven't heard of / can't find anything saying any aarch64 ever had problems with them either, so still much worse for the RISC-V side).
Well, I suppose we can hope that business/politics messes will all never happen again and won't affect anything RVA23.
This very much has a "for now" on it. Once there is actually widespread hardware with the feature, I would be very surprised if the compilers don't update their heuristics (at least for RVA23 chips)