Linus Torvalds on AVX512

140 points by ykm 5 years ago | 121 comments

robocat 5 years ago |

The AVX512 instructions can cause strange global performance downgrades.

“One challenge with AVX-512 is that it can actually _slow down_ your code. It's so power hungry that if you're using it on more than one core it almost immediately incurs significant throttling. Now, if everything you're doing is 512 bits at a time, you're still winning. But if you're interleaving scalar and vector arithmetic, the drop in clock speeds could slow down the scalar code quite substantially.“ - 3JPLW and https://blog.cloudflare.com/on-the-dangers-of-intels-frequen...

The processor does not immediately downclock when encountering heavy AVX512 instructions: it will first execute these instructions with reduced performance (say 4x slower) and only when there are many of them will the processor change its frequency. Light 512-bit instructions will move the core to a slightly lower clock.

* Downclocking is per core and for a short time after you have used particular instructions (e.g., ~2ms).

* The downclocking of a core is based on: the current license level of that core, and also the total number of active cores on the same CPU socket (irrespective of the license level of the other cores).

As per https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-us...

MaxBarraclough 5 years ago | |

> The AVX512 instructions can cause strange global performance downgrades.

Can other SIMD instructions (AVX2, say) do the same?

th3typh00n 5 years ago | | |

> Can other SIMD instructions (AVX2, say) do the same?

On Intel CPUs, yes. There's even a BIOS/UEFI setting to specify how much you want the clock frequency to drop when running AVX code called "AVX offset". AMD CPUs doesn't do that though as far as I know.

The thermal hit of using wider vectors decreases with every node shrink though, so expect the issue to become muted over time (which also explains why that doesn't apply to AMD - their only µarch with 256-bit execution units, Zen 2, is on a better node than Intel).

andoriyu 5 years ago | | |

AVX was slowing down some code if input was less than 128 bits wide.

termau 5 years ago | |

wonder if this could be used as a denial of service against a vps host node.

abainbridge 5 years ago |

What are the forces in chip design that are at play here? Over the last 10-15 years, fabs have continued to fit more and more logic gates per unit area, but haven't reduced the power consumption per gate as much. As a result, if you fill your modern chip with compute gates, you cannot use them all at once because the chip will melt. Or at least you can't have them all running at max clock rates. One solution is to increase the proportion of the chip used for SRAM (it uses less power per unit area than compute gates), this is what Graphcore have done. Another is to put down multiple different compute blocks, each designed for a different purpose, and only use them a-few-at-a-time. The big-little Arm designs in smartphones are an example of that. But I feel like AVX512 might be an example too. When they add ML accelerator blocks next, they also will not be able to be used flat out at the same time as the rest of the cores' resources.

I'm sure Intel should fix the problems Linus is complaining about, but I feel like chip vendors are being forced into this "add special purpose blocks" approach, as the only way to make their new chips better than their old ones.

tails4e 5 years ago | |

Jim Keller had an interesting talk recently [1] about ways of doing parallel processing to better us the billions of transistors we have - assuming the task is parallelizable. There's the scalar core (i.e the basic CPU) which is easy to program realtively. Then a scalar core with vector instructions - difficult to program efficiently. Then there are arrays of scalar cores, i.e. GPUs, so relatively easy to program again, and now a lot of startups with arrays of scalar cores each with vector engines, so expected to be most difficult to program. He didn't go into why vector instructions are hard to use efficiently, and hard for compiler writers, but I'd be interested if anyone here could explain that.

1. https://youtu.be/8eT1jaHmlx8

confuseshrink 5 years ago | | |

Vectorization: I'm not an expert in this area so I can only tell you what I've personally found difficult in dealing with vectorization. Usually it all comes down to alignment and vector lanes. To utilize the vector instructions you basically have to paint your memory into separate (but interleaved) regions that can be mapped to distinct vector lanes efficiently. Everything is fine as long as no two elements from separate lanes have to be mixed in some way, as soon as your computation requires that you incur a heavy cost.

Dealing with these issues might require you to know the corners of the instruction set really well or some times the solution is outside of the instruction set and is related to how your data structure is laid out in memory leading you to AoS vs SoA analysis etc.

Compilers and vectorization: Based on reading a lot of assembly output I think what compilers usually struggle with are assumptions that the human programmer know hold for a given piece of code, but the compiler has no right to make. Some of this is basic alignment, gcc and clang have intrinsics for these. Some times it's related to the memory model of the programming language disallowing a load or a store at specific points.

GPGPU programmability: GPUs being easy to program is something I take with a grain of salt, yes it's easy to get up and running with CUDA. Making an _efficient_ CUDA program however is easily as challenging if not more than writing an efficient AVX program.

pornel 5 years ago | | |

Here's more on the problem of SIMD and C compilers:

https://pharr.org/matt/blog/2018/04/18/ispc-origins.html#aut...

> as long as vectorization can fail (and it will), […] you must come to deeply understand the auto-vectorizer. […] This is a horrible way to program; it’s all alchemy and guesswork and you need to become deeply specialized about the nuances of a single compiler’s implementation

reitzensteinm 5 years ago | | |

GPUs aren't really arrays of scalar cores. All threads in a warp run in lock step. If one takes a branch they all do, with operations being masked off as needed.

It's not all that different conceptually to AVX-512 with mask registers, except the vector size is even larger and of course the programming model differs.

abainbridge 5 years ago | | |

> He didn't go into why vector instructions are hard to use efficiently, and hard for compiler writers, but I'd be interested if anyone here could explain that.

I have a simplistic explanation - maybe not what you're looking for but it is the best I can do...

At 12m23s in the video he says, "If you're working in a layer and the layers are well constructed (abstracted) you really can make a lot of progress. But if the top layer says, 'to make this really fast, go change the bottom layer', then its going to get all tangled up."

That's what implementing an algorithm on a SIMD architecture feels like to me. I have to figure out a way of filling my SIMD width with data each clock cycle, while in contrast, the specification of the algorithm deals with data one piece at a time.

Take insertion sort as a (bad) example.

    i ? 1
    while i < length(A)
        j ? i
        while j > 0 and A[j-1] > A[j]
            swap A[j] and A[j-1]
            j ? j - 1
        end while
        i ? i + 1
    end while

That algorithm cannot easily take advantage of SIMD. You have to change the algorithm to make it work with the architecture.

We'd probably say the algorithm is the top level of the abstraction stack, and the SIMD architecture is a level near the bottom. So this problem is the opposite way around to how Jim phrased it, but the point is that we have NOT got clean abstraction - an implementation in one layer depends on the implementation in another.

zozbot234 5 years ago | | |

Are GPU's really easier to program than scalar w/ SIMD (or vector insns)? The programming models you have to work with for GPGPU seem quite obscure, whereas with CPU and SIMD flipping a compiler switch gets you most of the way there, and self-contained intrinsics do the rest.

gnufx 5 years ago | | |

At least part of the problem is that computing mostly depends on moving data. Memory bandwidth is relatively low, so it's difficult to get enough actual floating point intensity, at least for "large" arrays even when it's theoretically available. A classic example is GEMM (generalized matrix multiplication) where you should expect a good implementation to get around 90% of peak performance, but also expect it to jump through various tricky hoops to get there. With, say, vector multiplication the hoops aren't available, and you're ultimately memory-bound. Yes, there's more to it than that, and SIMD has non-FP applications etc.

amelius 5 years ago | | |

How does this solve the power problem that GP is talking about?

michaelt 5 years ago | |

> Over the last 10-15 years, fabs have continued to fit more and more logic gates per unit area, but haven't reduced the power consumption per gate as much.

Kids these days get 8 cores for a 100W TDP.

When I was a boy, 100W got you a single core. And you didn't get dynamic frequency scaling, so it'd be putting out that heat all the time.

(We also had to walk to school barefoot in the snow, uphill both ways)

em500 5 years ago | | |

You must be young. Home PC CPUs from my youth drew only single digit watts. They didn't require any fan until the Pentium.

noisem4ker 5 years ago | | |

>putting out that heat all the time

Even if the frequency was fixed, dissipated heat did definitely vary together with the computing load.

blaser-waffle 5 years ago | | |

> so it'd be putting out that heat all the time

What's the problem? My old school pentiums kept my dorm room nice and toasty. Could keep my window cracked in the winter for fresh air while gentoo compiled...

msh 5 years ago | | |

Could you not use TDP to melt the snow ;)

pjmlp 5 years ago | |

The main problem is software, with GPGPUs you need to explicitly program for them, while with stuff like AVX there is this implicit hope that you just code as always and the compiler will take care of the rest via auto-vectorization and PhD level optimization algorithms.

Because outside artificial intelligence, graphics and audio, there is little else that common applications would use the GPGPU for, so the large majority of software developers keeps ignoring heterogeneous programming models.

teruakohatu 5 years ago | | |

> the compiler will take care of the rest via auto-vectorization and PhD level optimization algorithms.

With how AVX512 is implemented, there isn't much point in a compiler auto optimizing general purpose code to use it, because even if there is a theoretical speedup, it may well be slower in practice.

throwaway_pdp09 5 years ago | | |

> while with stuff like AVX there is this implicit hope that you just code as always and the compiler will take care of the rest via auto-vectorization and PhD level optimization algorithms

No. I recently could really, really have used the packed saturated integer arithmetic and horizontal addition in AVX2 (but my old machine doesn't support it) and even better, the same but 512 bits wide on AVX512. It would only have been 6 or 7 instructions, if that, but it was inner loop, and mattered. Using compiler intrinsics would have been fine. I think you're looking at things too narrowly.

viraptor 5 years ago | | |

In my ideal world you'd be able to mark a function "this should compile to / run on gpgpu" and the compiler would potentially tell you why it can't do that. I'm not even sure if anything is stopping us apart from implementing that apart from the effort required. Sure, many ways to write that code will result in terrible performance, but it would still be closer to the auto-vectorisation experience.

Actually we already have openmp to cuda (http://www2.engr.arizona.edu/~ece569a/Readings/GPU_Papers/3....) so just making it more production-ready would be perfect.

zozbot234 5 years ago | | |

> Because outside artificial intelligence, graphics and audio, there is little else that common applications would use the GPGPU for, so the large majority of software developers keeps ignoring heterogeneous programming models.

I think you got this backwards - the lack of developers' interest is what leads to the mistaken impression that GPU compute is only good for multimedia and FP-crunching workloads. Even looking at the success of GPU compute in mining cryptocoins (only ASIC's do better) ought to be enough to tell you that we could do a lot more with them if we cared to.

ksec 5 years ago | |

>but haven't reduced the power consumption per gate as much.

That is simply not true. You can run the 64 Core on EPYC 2 all at once at 3Ghz all with Air Cooling.

At every node they have reduced power consumption that is also one reason you see continuous performance improvement.

abainbridge 5 years ago | | |

> That is simply not true.

I'm not claiming anything controversial. Power not having scaled as well as area recently is often referred to as the end of Dennard scaling:

https://en.wikipedia.org/wiki/Dennard_scaling#Breakdown_of_D...

> You can run the 64 Core on EPYC 2 all at once at 3Ghz all with Air Cooling.

That can be true despite the fact that power hasn't scaled as well as area.

> At every node they have reduced power consumption

Yep, just not as much as they improved area.

bsder 5 years ago | |

> What are the forces in chip design that are at play here?

The "weak form" of Moore's Law--"Performance doubles every 12-18 months"--is dead and buried.

The "strong form" of Moore's Law is still active--"Transistor cost halves every 12-18 months".

This means that you can't make the primary paths any faster. So, all you can do is add functionality and pray that someone magically can make that functionality relevant to the primary use cases.

gridlockd 5 years ago | |

AVX is not a "special purpose block", it's Intel's answer to not adding special purpose blocks on customer demand, like you can do with ARM.

Crypto or video decoding comes to mind, those would be much faster with dedicated silicon, but more general AVX instructions can get you halfway there. Well, maybe a quarter. People point out that AVX uses a lot of power, but they ignore that the same algorithm running instead on more but simpler cores would use even more power.

throwaway_pdp09 5 years ago | | |

> but more general AVX instructions can get you halfway there

Maybe misunderstand you but there are some fairly non-general ops for encoding/decoding crypto

https://en.wikipedia.org/wiki/AVX-512#VAES

floatboth 5 years ago |

I agree that there's too much focus on FP, but SIMD is not all about FP. Every new SIMD ISA extension has something interesting for integer.

Here's an article about JITing x86 to AVX-512 to fuzz 16 VMs per thread:

https://gamozolabs.github.io/fuzzing/2018/10/14/vectorized_e...

raverbashing 5 years ago |

FP matters (especially with SIMD)

It matters to image/video/audio processing

It matters to simulations

It matters to 3D models/rendering

It matters to games

So it's not "just benchmarks", people actually want to do stuff with it

Sure, AVX512 might not be the greatest way of doing it, and it might be better to just make the existing instructions go faster, that might work

rbanffy 5 years ago |

I for one would be delighted by having more caches or wider backends instead of AVX512, but I don't want SIMD to be pushed into GPUs. It'd be better to do the reverse - to push forward the asymmetric core idea and move more GPU functionality into lots of simpler cores tuned for SIMD at the cost of single thread performance.

molticrystal 5 years ago | |

Here are some shots of the Mask Registers https://travisdowns.github.io/blog/2020/05/26/kreg2.html#the...

If seems like they just keep that area mostly empty in processors without that feature, at least for the processors related to the one pictured. Not really sure how much cache that would be effective could fit without a major overhaul, but likely a chip designer or enthusiast would. This could be why Linus focused on computational enhancement when he discussed transistor budget.

rbanffy 5 years ago | | |

From a quick glance at the proportions and considering not only the register files are halved, but also the vector EUs, I'd expect a 25% increase in L3 or a 50% in L2. That and some lessened thermal constraints.

throwaway_pdp09 5 years ago | |

I really don't know if that would help much. Better cache management might give more bonus than just bigger caches or higher bandwidth.

rbanffy 5 years ago | | |

It depends on your workload, but if you are wasting too much time with L3 misses, more cache (and more memory channels) is a good idea.

TazeTSchnitzel 5 years ago | |

So, Larrabee?

protomyth 5 years ago | | |

Cell with a saner bus/memory access?

jasonzemos 5 years ago |

AVX-512's richness to x86 is like what C++'s is to C. Linus makes a summary assessment for how he can leverage these technologies to his advantage and if the cost of learning the technology and all its intricacies outweighs the perceived advantage: that technology is garbage. This reaction from Linus appears to fit his conservative pattern. I think where Linus gets things wrong stems from his facts rather than his philosophy.

AVX-512's fantastic breadth is born out of an actual need to free compilers from constraints imposed by programs in virtually every mainstream language. All of these describe programs for an academic-machine rooted in a scalar instruction model. Without any further performance from increasing cycles over time the target has to become instructions-per-cycle and even operations-per-instruction. The limitations on ILP and the expense of powering circuitry to achieve it has been well studied for the past two decades. The failure to realize it is evident in the failure of Netburst. Linus believes that the frontend of CPU's have a lot more to give; perhaps best exhibited with his refutation of CMOV (https://yarchive.net/comp/linux/cmov.html).

Today's programming languages haven't evolved to make things easier on programmers to describe non-scalar code. On the other hand, power constraints, and now security constraints haven't made things easier for hardware to efficiently execute scalar code. Perhaps AVX-512 is as naive a bet as Itanium, if not it might be just the missing piece compilers need that they didn't have twenty years ago.

RantyDave 5 years ago |

Are Intel just delaying the inevitable? Is it safe to say (even today) that a slow GPU will crunch big matrices faster than a fast CPU? And that's before we get to price/performance. So all that's left is the bottleneck around PCIe which, in theory, leaves the CPU with an advantage only for small datasets - which we don't really care about anyway (because they happen quickly).

Maybe the tradeoff is somewhere interesting from a latency perspective - SDR or similar. I dunno, am I barking up the wrong tree?

fancyfredbot 5 years ago |

AVX512 is both integer and floating point, not just FP, so this rant about FP comes across as ill informed.

Despite that I'd agree most people probably see no benefit from these units today. But that could change. For workloads with parallelism, wide SIMD is very efficient - more so than multiple threads anyway. The only way to get people to write vector code is to have vector processing available. Once it's ubiquitously available people might code for it and the benefits may become more apparent.

throwaway_pdp09 5 years ago |

The very wide AVX stuff with integer ops, like these from wiki:

- AVX-512 Byte and Word Instructions (BW) – extends AVX-512 to cover 8-bit and 16-bit integer operations[3]

- AVX-512 Integer Fused Multiply Add (IFMA) - fused multiply add of integers using 52-bit precision.

could be very useful. I could have done with those recently. They also don't (AFAIK) cause cpu scaling (polite term for downclocking). He may well be right with FP though.

superjan 5 years ago | |

52 bit precision? typo?

sdflhasjd 5 years ago | | |

52 bits is the size of the mantissa in an IEEE 754 double precision floating point

throwaway_pdp09 5 years ago | | |

Well caught. But https://www.felixcloutier.com/x86/vpmadd52luq

kzrdude 5 years ago | | |

Sounds suspiciously like integers in the float mantissa

gridlockd 5 years ago | |

If he was right with FP, he'd know better than the business analysts at Intel. Instead, his opinion is based on what the market looked like thirty years ago.

Nine years ago, AMD tested the hypothesis that really more "cores" and higher integer throughput were all that was needed and that FP performance didn't matter. The resulting architecture (Bulldozer) was a near-fatal disaster. It didn't even work out in the datacenter, where you might expect that hypothesis to hold.

throwaway_pdp09 5 years ago | | |

AMD is currently giving intel great pain. So much for business analysts at Intel.

dang 5 years ago |

A related discussion is here: https://news.ycombinator.com/item?id=23822203, also with interesting comments.

Since this thread is of the second freshness, we won't merge.

bartwe 5 years ago |

Down with simd, up with spmd/compute

nullc 5 years ago |

There are 1001 AVX512 variations, but few equivalent operations to the RISV bit manipulation instructions.

gridlockd 5 years ago |

"I've said this before, and I'll say it again: in the heyday of x86, when Intel was laughing all the way to the bank and killing all their competition, absolutely everybody else did better than Intel on FP loads. Intel's FP performance sucked (relatively speaking), and it matter not one iota.

Because absolutely nobody cares outside of benchmarks."

That was back in the stone age when a lot of applications for FP math weren't mainstream. Most of AVX-512 doesn't even concern FP, there's lots of integer and bit twiddling stuff there.

Furthermore, people really do care about these benchmarks. It influences their purchasing, which is really the thing that matters most to Intel. A lot of people don't actually care about hypothetical security issues or the fact that the CPU is 14nm when it still outperforms 7nm in single-threaded code.

Also, it's not like you can just trade off IPC or extra cores for wider SIMD. It's not like "just add more cores" is just as good for throughput, otherwise GPUs wouldn't exist. Wider SIMD is cheap in terms of die area, for the throughput it gives you.

Lastly, these are just instructions, nothing says that an AVX-512 instruction needs to go through a physical 512-bit wide unit, it just says that you can take advantage of those semantics, if possible.

CamperBob2 5 years ago |

Intel's FP performance sucked (relatively speaking), and it matter not one iota. Because absolutely nobody cares outside of benchmarks.

Today I learned that even Linus Torvalds has a bozo bit. [1] When's the last time he actually did anything with a computer?

1: https://en.wikipedia.org/wiki/Bozo_bit

topspin 5 years ago | |

He has the history correct. Most of the CPUs that x86 beat in the market had superior FP performance; SPARCs, Alphas, PA-RISC, Itanium, etc.

> When's the last time he actually did anything with a computer?

According to Linus he completes about 30 pull requests a day. Some multiple of that in kernel builds. His $1900 32 core Threadripper speeds that process a great deal and FP contributes little to nothing.

Today people stream video+audio, encrypt+decrypt and render graphics. All of these have specialized silicon. If their AVX-512 vanished in the night almost no one would notice the next day.

Maybe we should all be astronomers and thermodynamicists writing bespoke finite element simulations and have a deep appreciation for the wonders of floating point ISAs, but that's just not the real world.

azalemeth 5 years ago | | |

Speaking as someone who does scientific computing all day long, in part with FEM simulations, even for me AVX512 isn't usually worth it in terms of wall-clock time.

TinkersW 5 years ago | | |

They wouldn't notice AVX512 vanishing because they never had it in the first place, as Intel hasn't shipped it in CPU people actually use for those tasks--just servers and random laptops.

As for the rest, you are wrong, AVX/512 is not just floating point by any means, and floating point is used by more than just scientific workloads.

Games/simulations/modeling software etc all can make heavy use of floating point.