[0] https://doc.rust-lang.org/stable/std/macro.is_x86_feature_de...
But yeah no, on the whole cost of the checks and duplicated binary size aren't seen as worth it, so instead it's piecemeal implementations mostly in numeric packages like eigen and lapack.
Because that’s where the user-noticeable gains can be made. Using popcount in code you run once is going to shave off, maybe, 100 cycles. That isn’t worth the extra cycles of that approach.
Also, FTA: “and arguably the whole scheme should be replaced by finer-grained feature detection”. Such feature detection would lead to a combinatorial explosion of different binaries.
Finally, where it really matters, it’s not only a matter of recompiling the same code. For optimal performance, you also want to change loop unrolling strategy, stride count, etc.
This seems like a strange thing to say. Fine grained feature detection was around long before "microarchitecture levels" and never went away. The microarchitecture levels were introduced because they were easier to use.
It's not entirely free; the cost is that the resulting binary will no longer run on processors that lack the instruction. Which, admittedly, is ≈2007 or older. But still! I have a 2012 CPU still in service, and as much as I'd love to obsolete it, gestures at the price tag of RAM these days.
… a 2012 CPU is surprisingly competitive relative to today's tech, too, I'd add. The gap between 2012 and 2026 is nothing compared to the equivalent gap between 1998 and 2012: 1998 is like 500MHz single-core, 32-bit. 2012 is 4 core, 8 hyper threads, 64-bit, 3.5 GHz. (… perhaps more remarkably, my next-oldest machine, a 2017 laptop, is only 2.8 GHz, with the same 4(/8) cores. It also uses like half the power, too. That's mostly the "laptop" bit, though.)
(That same CPU is also incapable of "v3".)
I suspect that heavily optimised code either uses intrinsics or carefully written assembler code.
Ubuntu started allowing defaulting to v3 packages, and I opted in. I already use the -C native to enable AVX512 when compiling binaries for local use. This matters a lot for compute/analytics workloads in my experience.
Speaking of Dr Lemire's suggestion of a V5 architecture level, would that make any sense given the fragmentation of AVX512? None on Intel consumer devices, but it is on the last few generations of AMD.
I wonder if this is a natural law, or emergent behavior of complex systems?
https://go.dev/wiki/MinimumRequirements#:~:text=The%20Go%20t...
Edit: to address your literal remark: so even the title is correct, if you think of a programming language as more than its syntax.
Go's selling point is definitely not performance.
For many other things, like using a YMM register to copy a 32-byte struct or a variable shift, run-time dispatch just not make sense. You will only see a benefit if you generate this code unconditionally. For FMA, you wouldn't even get bit-identical output, leading to testing concerns.
the thread is about runtime detection tbf
[0] https://www.phoronix.com/review/clear-linux-48p-ubuntu/6