Linus Torvalds on AVX512(phoronix.com) |
Linus Torvalds on AVX512(phoronix.com) |
“One challenge with AVX-512 is that it can actually _slow down_ your code. It's so power hungry that if you're using it on more than one core it almost immediately incurs significant throttling. Now, if everything you're doing is 512 bits at a time, you're still winning. But if you're interleaving scalar and vector arithmetic, the drop in clock speeds could slow down the scalar code quite substantially.“ - 3JPLW and https://blog.cloudflare.com/on-the-dangers-of-intels-frequen...
The processor does not immediately downclock when encountering heavy AVX512 instructions: it will first execute these instructions with reduced performance (say 4x slower) and only when there are many of them will the processor change its frequency. Light 512-bit instructions will move the core to a slightly lower clock.
* Downclocking is per core and for a short time after you have used particular instructions (e.g., ~2ms).
* The downclocking of a core is based on: the current license level of that core, and also the total number of active cores on the same CPU socket (irrespective of the license level of the other cores).
As per https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-us...
Can other SIMD instructions (AVX2, say) do the same?
On Intel CPUs, yes. There's even a BIOS/UEFI setting to specify how much you want the clock frequency to drop when running AVX code called "AVX offset". AMD CPUs doesn't do that though as far as I know.
The thermal hit of using wider vectors decreases with every node shrink though, so expect the issue to become muted over time (which also explains why that doesn't apply to AMD - their only µarch with 256-bit execution units, Zen 2, is on a better node than Intel).
I'm sure Intel should fix the problems Linus is complaining about, but I feel like chip vendors are being forced into this "add special purpose blocks" approach, as the only way to make their new chips better than their old ones.
Dealing with these issues might require you to know the corners of the instruction set really well or some times the solution is outside of the instruction set and is related to how your data structure is laid out in memory leading you to AoS vs SoA analysis etc.
Compilers and vectorization: Based on reading a lot of assembly output I think what compilers usually struggle with are assumptions that the human programmer know hold for a given piece of code, but the compiler has no right to make. Some of this is basic alignment, gcc and clang have intrinsics for these. Some times it's related to the memory model of the programming language disallowing a load or a store at specific points.
GPGPU programmability: GPUs being easy to program is something I take with a grain of salt, yes it's easy to get up and running with CUDA. Making an _efficient_ CUDA program however is easily as challenging if not more than writing an efficient AVX program.
https://pharr.org/matt/blog/2018/04/18/ispc-origins.html#aut...
> as long as vectorization can fail (and it will), […] you must come to deeply understand the auto-vectorizer. […] This is a horrible way to program; it’s all alchemy and guesswork and you need to become deeply specialized about the nuances of a single compiler’s implementation
It's not all that different conceptually to AVX-512 with mask registers, except the vector size is even larger and of course the programming model differs.
I have a simplistic explanation - maybe not what you're looking for but it is the best I can do...
At 12m23s in the video he says, "If you're working in a layer and the layers are well constructed (abstracted) you really can make a lot of progress. But if the top layer says, 'to make this really fast, go change the bottom layer', then its going to get all tangled up."
That's what implementing an algorithm on a SIMD architecture feels like to me. I have to figure out a way of filling my SIMD width with data each clock cycle, while in contrast, the specification of the algorithm deals with data one piece at a time.
Take insertion sort as a (bad) example.
i ? 1
while i < length(A)
j ? i
while j > 0 and A[j-1] > A[j]
swap A[j] and A[j-1]
j ? j - 1
end while
i ? i + 1
end while
That algorithm cannot easily take advantage of SIMD. You have to change the algorithm to make it work with the architecture.We'd probably say the algorithm is the top level of the abstraction stack, and the SIMD architecture is a level near the bottom. So this problem is the opposite way around to how Jim phrased it, but the point is that we have NOT got clean abstraction - an implementation in one layer depends on the implementation in another.
Kids these days get 8 cores for a 100W TDP.
When I was a boy, 100W got you a single core. And you didn't get dynamic frequency scaling, so it'd be putting out that heat all the time.
(We also had to walk to school barefoot in the snow, uphill both ways)
Even if the frequency was fixed, dissipated heat did definitely vary together with the computing load.
What's the problem? My old school pentiums kept my dorm room nice and toasty. Could keep my window cracked in the winter for fresh air while gentoo compiled...
Because outside artificial intelligence, graphics and audio, there is little else that common applications would use the GPGPU for, so the large majority of software developers keeps ignoring heterogeneous programming models.
With how AVX512 is implemented, there isn't much point in a compiler auto optimizing general purpose code to use it, because even if there is a theoretical speedup, it may well be slower in practice.
No. I recently could really, really have used the packed saturated integer arithmetic and horizontal addition in AVX2 (but my old machine doesn't support it) and even better, the same but 512 bits wide on AVX512. It would only have been 6 or 7 instructions, if that, but it was inner loop, and mattered. Using compiler intrinsics would have been fine. I think you're looking at things too narrowly.
Actually we already have openmp to cuda (http://www2.engr.arizona.edu/~ece569a/Readings/GPU_Papers/3....) so just making it more production-ready would be perfect.
I think you got this backwards - the lack of developers' interest is what leads to the mistaken impression that GPU compute is only good for multimedia and FP-crunching workloads. Even looking at the success of GPU compute in mining cryptocoins (only ASIC's do better) ought to be enough to tell you that we could do a lot more with them if we cared to.
That is simply not true. You can run the 64 Core on EPYC 2 all at once at 3Ghz all with Air Cooling.
At every node they have reduced power consumption that is also one reason you see continuous performance improvement.
I'm not claiming anything controversial. Power not having scaled as well as area recently is often referred to as the end of Dennard scaling:
https://en.wikipedia.org/wiki/Dennard_scaling#Breakdown_of_D...
> You can run the 64 Core on EPYC 2 all at once at 3Ghz all with Air Cooling.
That can be true despite the fact that power hasn't scaled as well as area.
> At every node they have reduced power consumption
Yep, just not as much as they improved area.
The "weak form" of Moore's Law--"Performance doubles every 12-18 months"--is dead and buried.
The "strong form" of Moore's Law is still active--"Transistor cost halves every 12-18 months".
This means that you can't make the primary paths any faster. So, all you can do is add functionality and pray that someone magically can make that functionality relevant to the primary use cases.
Crypto or video decoding comes to mind, those would be much faster with dedicated silicon, but more general AVX instructions can get you halfway there. Well, maybe a quarter. People point out that AVX uses a lot of power, but they ignore that the same algorithm running instead on more but simpler cores would use even more power.
Maybe misunderstand you but there are some fairly non-general ops for encoding/decoding crypto
Here's an article about JITing x86 to AVX-512 to fuzz 16 VMs per thread:
https://gamozolabs.github.io/fuzzing/2018/10/14/vectorized_e...
It matters to image/video/audio processing
It matters to simulations
It matters to 3D models/rendering
It matters to games
So it's not "just benchmarks", people actually want to do stuff with it
Sure, AVX512 might not be the greatest way of doing it, and it might be better to just make the existing instructions go faster, that might work
If seems like they just keep that area mostly empty in processors without that feature, at least for the processors related to the one pictured. Not really sure how much cache that would be effective could fit without a major overhaul, but likely a chip designer or enthusiast would. This could be why Linus focused on computational enhancement when he discussed transistor budget.
AVX-512's fantastic breadth is born out of an actual need to free compilers from constraints imposed by programs in virtually every mainstream language. All of these describe programs for an academic-machine rooted in a scalar instruction model. Without any further performance from increasing cycles over time the target has to become instructions-per-cycle and even operations-per-instruction. The limitations on ILP and the expense of powering circuitry to achieve it has been well studied for the past two decades. The failure to realize it is evident in the failure of Netburst. Linus believes that the frontend of CPU's have a lot more to give; perhaps best exhibited with his refutation of CMOV (https://yarchive.net/comp/linux/cmov.html).
Today's programming languages haven't evolved to make things easier on programmers to describe non-scalar code. On the other hand, power constraints, and now security constraints haven't made things easier for hardware to efficiently execute scalar code. Perhaps AVX-512 is as naive a bet as Itanium, if not it might be just the missing piece compilers need that they didn't have twenty years ago.
Maybe the tradeoff is somewhere interesting from a latency perspective - SDR or similar. I dunno, am I barking up the wrong tree?
Despite that I'd agree most people probably see no benefit from these units today. But that could change. For workloads with parallelism, wide SIMD is very efficient - more so than multiple threads anyway. The only way to get people to write vector code is to have vector processing available. Once it's ubiquitously available people might code for it and the benefits may become more apparent.
- AVX-512 Byte and Word Instructions (BW) – extends AVX-512 to cover 8-bit and 16-bit integer operations[3]
- AVX-512 Integer Fused Multiply Add (IFMA) - fused multiply add of integers using 52-bit precision.
could be very useful. I could have done with those recently. They also don't (AFAIK) cause cpu scaling (polite term for downclocking). He may well be right with FP though.
Nine years ago, AMD tested the hypothesis that really more "cores" and higher integer throughput were all that was needed and that FP performance didn't matter. The resulting architecture (Bulldozer) was a near-fatal disaster. It didn't even work out in the datacenter, where you might expect that hypothesis to hold.
Since this thread is of the second freshness, we won't merge.
Because absolutely nobody cares outside of benchmarks."
That was back in the stone age when a lot of applications for FP math weren't mainstream. Most of AVX-512 doesn't even concern FP, there's lots of integer and bit twiddling stuff there.
Furthermore, people really do care about these benchmarks. It influences their purchasing, which is really the thing that matters most to Intel. A lot of people don't actually care about hypothetical security issues or the fact that the CPU is 14nm when it still outperforms 7nm in single-threaded code.
Also, it's not like you can just trade off IPC or extra cores for wider SIMD. It's not like "just add more cores" is just as good for throughput, otherwise GPUs wouldn't exist. Wider SIMD is cheap in terms of die area, for the throughput it gives you.
Lastly, these are just instructions, nothing says that an AVX-512 instruction needs to go through a physical 512-bit wide unit, it just says that you can take advantage of those semantics, if possible.
Today I learned that even Linus Torvalds has a bozo bit. [1] When's the last time he actually did anything with a computer?
> When's the last time he actually did anything with a computer?
According to Linus he completes about 30 pull requests a day. Some multiple of that in kernel builds. His $1900 32 core Threadripper speeds that process a great deal and FP contributes little to nothing.
Today people stream video+audio, encrypt+decrypt and render graphics. All of these have specialized silicon. If their AVX-512 vanished in the night almost no one would notice the next day.
Maybe we should all be astronomers and thermodynamicists writing bespoke finite element simulations and have a deep appreciation for the wonders of floating point ISAs, but that's just not the real world.
As for the rest, you are wrong, AVX/512 is not just floating point by any means, and floating point is used by more than just scientific workloads.
Games/simulations/modeling software etc all can make heavy use of floating point.
Back in the day, CPUs didn't come with FPUs and the latter were optional co-processors.
The idea in the x86-world always was to "outsource" special requirements to dedicated hardware (FP co-processors, GPUs, sound cards, network cards, hardware codec cards, etc.), instead of putting them on the CPU package (like ARM-based SoCs).
So it's different philosophies entirely - tightly integrated SoCs vs versatile and flexible component-based hardware.
It's The One Ring ([ARM-based] SoCs) vs freedom of choice and modularity (PC). If I don't do simulations or 3d-modelling/rendering, I am free to choose a cheap display adapter without powerful 3D-acceleration and choose a better audio interface instead (e.g. for music production).
The SoC approach forces me to buy that fancy AI/ML-accelerator, various video codecs, and powerful graphics hardware with my CPU regardless of my needs, because the benevolent system provider (e.g. Apple) deems it fit for all...
Torvalds is just old-school in that he prefers freedom of choice and the "traditional" PC over highly integrated SoCs.
At the old days there were minor competitors to the x87 family that died quickly. (For reference: https://en.wikipedia.org/wiki/X87#Manufacturers )
For the rest yeah, it kinda makes sense to have them customizable.
Actually the Atari and Amigas were there first, that was PC catching up with their multimedia capabilities.
IIRC when bulldozer was released and Intel's propaganda machine started spewing stories about how AMD core count was fake because two cores shared a FP unit, there was a flurry of scientific papers on the subject.
IIRC, it was determined that even the hot path of FP-intensive code only executed a single FP ops for each 7 non-FP operations. To put it differently, between each FP op all code has to execute ops to move data around.
Consequently, bulldozer's FP benchmarks scaled linearly wrt cores because even when multiple cores had to share a FP unit to run FP operations, they were so relatively scarce even in number-crunching applications that cores didn't blocked, thus overall performance was not affected.
That's the relevance of FP in real-world benchmarks.
The only thing that Intel has going for their GPUs is that as typically happens with the underdog companies, they decided to play nice with FOSS drivers and with integrated GPUs they own the low budget laptop market.
Everyone that has done any serious 3D programming is painfully aware how bad their OpenGL drivers used to be, they even used to fake OpenGL queries confirming features as supported, when they were actually implemented in software, thus making some games unusable.
That is why they started the campaign about optimizing games for Intel GPUs, and how to make best use of Graphics Profile Analyser, which ironically in the old days was DirectX only.
The bottleneck you mention is only an issue when there isn't any shared memory available, if the hardware allows for unified memory models then there is no data transfer and the GPU can work right way, naturally there are some synchronization points that need to happen still.
AVX512 in particular has issues. Using it slows down the CPU so actual wall clock benefits depends heavily on how it is used.
Of any day's computational workload, only graphics, (parts of) ML, and maybe space heating masquerading as financial innovation are amendable to be run in such a fashion. And those workloads are, as far as I can tell, already being run on GPUs (and similar) almost universally.
So I don't think there actually are major workloads that will shift away from Intel to GPUs in the near future?
Both Zen and Intel lower their clocks under load especially AVX, keep in mind that Zen 2 doesn’t even reach its advertised boost clocks under any load some CPUs come close to within 100mhz or so but overall they all clock down rather fast once TMax or PMax is reached.
386, introduced 1985:
http://www.cpu-world.com/CPUs/80386/Intel-A80386-16.html
Typical/Maximum power dissipation: 1.85 Watt / 2.3 Watt
And even no Pentium III 1999-2003 needed more than around 30 W:
https://en.wikipedia.org/wiki/List_of_Intel_Pentium_III_micr...
I prefer intrinsics as they give more control than shader languages and they can be written in C++ instead of fiddling with some garbage GPU API that runs async.
Also one of the reasons CUDA won developer love is that it fully embraced polyglot programming on the GPU.
CUDA seems nice, but being Nvidia only makes it a total dead end.
Fugaku is the opposite of that, each CPU chip is 48 cores with 512 bit wide SVE(arm version of AVX512).
They deliberately went for something easier to program, that didn't require doing the CPU/GPU dance.
All our laptops have Intel stickers on them and I doubt AMD is winning crazy dollars on cloud deployments.
https://www.phoronix.com/scan.php?page=article&item=amd_fx81...
https://www.phoronix.com/scan.php?page=article&item=amd_fx83...
If Linus' attitude of "I'd rather have more cores" and "FP doesn't really matter" were representative of market demand, you'd have expected Bulldozer to do well at least somewhere, as opposed to nowhere.
That is the point of Linus. He would have preferred to use that increase in transistor count for other things, like more cache.
Skylake is less than 30% cache. However internally it's 512bus, thanks to avx-512 - which could be considered suboptimal.
If they were actually getting twice the integer performance per module as Intel was getting per core then it might've been interesting, but being the same or only slightly better when comparing modules to cores wasn't enough to overcome the single thread performance deficit which people still care about a lot.
The Bulldozer really did have a big advantage in integer throughput per dollar, but that does not translate to a 2x speedup in pretty much any benchmark. FP throughput on the other hand shows up a lot.
Thanks for reading.
There's also HIP[1], which can be used as a thin wrapper around CUDA, or with the ROCm backend on AMD platforms. It doesn't yet match CUDA in either breadth of features or maturity, but it's getting closer every day.
I wish all the GPU companies would get together and make a standard based on C++ and stick with it.
In what concerns commercial uses of CUDA, Hollywood doesn't seem to have any problem with it, nor the car manufacturers with Jetson.
Or are you speaking about the 1% Linux users on Steam?
It does look like Intel is supporting at least, so maybe in the future it will be a good option.