GPU advancements in M3 and A17 Pro [video]

GPU advancements in M3 and A17 Pro [video](developer.apple.com)

187 points by bhj 2 years ago | 142 comments

pjmlp 2 years ago |

There are additionally related videos,

"Discover new Metal profiling tools for M3 and A17 Pro"

https://developer.apple.com/videos/play/tech-talks/111374/

"Learn performance best practices for Metal shaders"

https://developer.apple.com/videos/play/tech-talks/111373/

"Bring your high-end game to iPhone 15 Pro"

https://developer.apple.com/videos/play/tech-talks/111372/

frogblast 2 years ago |

If you're interested in more background about one user-visible problem being directly attacked by this new GPU architecture, that could be "shader compilation stutter" (although there are many others).

These are two excellent posts that go deep on this:

The Shader Permutation Problem - Part 1: How Did We Get Here?

The Shader Permutation Problem - Part 2: How Do We Fix It?

In particular, the second post has the line:

  We probably should not expect any magic workarounds for static register allocation: if a callable shader requires many registers, we can likely expect for the occupancy of the entire batch to suffer. It’s possible that GPUs could diverge from this model in the future, but that could come with all kinds of potential pitfalls (it’s not like they’re going to start spilling to a stack when executing thousands of pixel shader waves).

... And some kind of 'magic workaround for static register allocation' is pretty much what has been done.

https://therealmjp.github.io/posts/shader-permutations-part1...

https://therealmjp.github.io/posts/shader-permutations-part2...

zmmmmm 2 years ago |

Does apple document exactly how many actual true cores there are inside their GPUs? It is always confusing they say "40 core GPU" but I assume these are shader cores which each inside them can execute (per the video) "many thousands" of parallel execution paths.

So how does one translate to an equivalent in "CUDA cores" type terminology?

pavlov 2 years ago | |

What Apple calls a GPU core seems to be roughly the same as what Nvidia calls a “stream multiprocessor”.

For example a 1080 GTX GPU has 20 stream multiprocessors (SM), each containing 128 cores, each of which supports 16 threads.

Meanwhile Apple describes the M1 GPU as having 8 cores, where “each core is split into 16 Execution Units, which each contain eight Arithmetic Logic Units (ALUs). In total, the M1 GPU contains up to 128 Execution units or 1024 ALUs, which Apple says can execute up to 24,576 threads simultaneously and which have a maximum floating point (FP32) performance of 2.6 TFLOPs.”

So one option to get a single number for a rough comparison is to count threads. The 1080 GTX supports 40,960 threads while the M1 supports 24,576 threads.

There’s obviously a lot more to a GPU — for starters, varying clock speeds, ALUs can have different capabilities, memory bandwidth, etc. But at least counting threads gives a better idea of the processing bandwidth than talking about cores.

KeplerBoy 2 years ago | | |

Just for clarification: The 1080 has 20 SMs with 128 FPUs each. Each FPU can perform 2 FLOPs per cycle (fused multiply adds). Combined with the frequency of 1607 MHz we land on the advertised 8.2 TFlop/s.

The fact that each SM can support 1024 threads (that's the maximum blocksize of CUDA on that card) doesn't do much for the theoretical flops. Only a fraction of those threads can be active at a time. The others are idling or waiting on their memory requests. This hides a lot of the i/o latency.

thaanpaa 2 years ago | | |

If the architecture is vastly different, these comparisons become sort of meaningless, though. The ultimate performance is determined by all the tiny little bottlenecks, like how quickly the architecture can move data between different types of cores, memory, cache, etc.

Apple has always been really good at parallelism, which is why they get so much performance from less power consumption.

cm2187 2 years ago | | |

Question from a GPU novice. I presume a thread is the individual calculation perform on one element of the vector? Can the 128 cores, or 20 SM perform different operations at the same time or all 24,576 threads perform the same operation at the same time, on a vector of data of length of 24,576?

jeffybefffy519 2 years ago | | |

Even moreso we shouldnt assume they have similar architectural layouts.

cmovq 2 years ago | |

I think a better comparison is to look at floating point performance. For example, the 10 core M2 GPU does 3.6 TFLOPS (FP32) while an RTX 4060 does 15 TFLOPS and an RTX 4090 82.58

reroute22 2 years ago | | |

Unfortunately, hardly. Ampere's (Nvidia 3000 series), Ada's (Nvidia 4000 series), and RNDA 3's (AMD 7000 series) GPUs have doubled up their FP32 units in ways that differ in implementation (between AMD and Nvidia) but are relatively similarly poor in their ability to be utilized properly at rates much higher than pre-doubling (Nvidia is doing better than AMD in that, but very far from great).

The formal TFLOPS comparison as a result would be most sensible between pre-M3 designs, AMD 6000 series (RNDA 2), and Nvidia's 2000 series (Turing). After that it gets really murky with AMD's "TFLOPS" looking nearly 2x more than they are actually worth by the standards of prior architectures, followed by Nvidia (some coefficient lower than 2, but still high), followed by M3 which from the looks of it is basically 1.0x on this scale, so long as we're talking FP32 TFLOPS specifically as those are formally defined.

You can see this effect the easiest by comparing perf & TFLOPS of AMD 6000 series and Nvidia 3000 series - they have released nearly at the same time, but AMD 6000 is one gen before the "near-fake-doubling", while Nvidia's 3000 series is the first gen with the "close-to-fake-doubling": with a little effort you'll find GPUs between these two that perform very similarly (and have very similar DRAM bandwidth), but Ampere's counterpart has almost 2x the FP32 TFLOPS.

YetAnotherNick 2 years ago | | |

While FP32 non tensor flops at least looks comparable, FP16/BF16 with tensor core(nowadays a default for any neural network including LLM) at 330 TFlops/s blows away M2.

wmf 2 years ago | |

Each Apple core (heh) has 128 FPUs so 40 cores would be akin to 5120 CUDA "cores".

pixelpoet 2 years ago | | |

Not quite, because Nvidia counts dual issue as a flat doubling of "core" (which previously you could accurately call a vector lane) count.

kergonath 2 years ago | |

> So how does one translate to an equivalent in "CUDA cores" type terminology?

I don’t really think we can, even if we knew exactly what is in a M3 GPU core, which we don’t. Both architectures are very different, and different again from AMD GPUs. We have to count Tflops.

dontlaugh 2 years ago | |

Nvidia "cheat" by counting approximately ALUs.

gary_0 2 years ago |

I skimmed the video but a lot of it sounded more like advertising than technical information to me. On the other hand, I'm looking forward to watching the Asahi folks crack this stuff open.

runeks 2 years ago |

> I'm excited to tell you about the new Apple family 9 GPU architecture in A17 Pro and the M3 family of chips, which are at the heart of iPhone 15 Pro and the new Max.

"The new Max"? He clearly meant "the new Macs".

Kinda weird that Apple can't properly transcribe its own content.

Tijdreiziger 2 years ago | |

I’ve also noticed this on lots of YouTube videos, where the creator clearly meant one thing, but the subtitles substitute a more common, similarly-sounding word with a different meaning.

I suspect they have the videos transcribed externally, and don’t check the transcription (or only do so in a cursory manner).

TheCapeGreek 2 years ago | | |

Or automated transcription.

For YT vids, especially shorts, it's because churning out shorts/reels/tiktoks of clips from longer form videos (and/or with the split screen gameplay of some mobile game/minecraft platforming run) is now a common tactic for trying to gain tons of views on your account for monetisation later.

eviks 2 years ago | |

Why is it surprising, it's not like content ownership gives you any advantage in the typical transcription algorithms

runeks 2 years ago | | |

It’s surprising since I would expect Apple to check whatever they get back from a transcription service.

Reason077 2 years ago |

I’m always impressed with the speech synthesis that Apple uses to make the voiceovers in these videos. Some of them almost sound like real people!

stingraycharles 2 years ago | |

The guy introduced himself by name, I was really confused for a while if it was just a human trained to sound like an AI, or an AI trained to sound like a human.

Reason077 2 years ago | | |

Perhaps they ask the AI speech/content generator to give itself a name as part of its training prompt.

SushiHippie 2 years ago | |

They even created a Twitter profile for this AI persona

https://nitter.net/jhaberstro

franzb 2 years ago | |

From the variety of intonations based on context, I doubt it’s speech synthesis.

simbolit 2 years ago | | |

I think parent comment is being ironic. Not sure tho.

TradingPlaces 2 years ago |

Media is so focused on CPUs, they are missing the fact that Apple focused on the GPU and Neural Engine for this round of chips

wincy 2 years ago | |

It was wild seeing Linus Tech Tips demoing resident evil village on the iPhone 15 Pro.

rsynnott 2 years ago | |

This is unfortunately inevitable; CPUs are just so much easier to benchmark in a broadly useful way. And the extreme leakiness of geekbench is helpful (I suspect Apple sees this as a feature; most recent Apple chip iterations have leaked on geekbench)

reroute22 2 years ago |

Okay, I went through the other video they reference ("Discover new Metal profiling tools for M3 and A17 Pro" [1]), and there is actually a whole bunch of extra very relevant (IMO) information on the subject, starting about 13:30 or so.

[1] https://developer.apple.com/videos/play/tech-talks/111374?ti...

w10-1 2 years ago |

The video is just enough of a peek into the GPU's to encourage people to write using Metal API's (and by the way, use the new APIs and FP16).

jeffybefffy519 2 years ago | |

They should just support directx. Devs will never support two graphics api’s. it costs too much especially to grab the marginal mac os share that has powerful enough gpu’s. Id bed in 4 years apple moves to directx.

flohofwoe 2 years ago | | |

> Id bed in 4 years apple moves to directx.

Not going to happen, what's more likely is a Proton-like layer above macOS APIs to simplify porting games over. Also see "Game Porting Toolkit" here: https://developer.apple.com/games/

pjmlp 2 years ago | | |

Devs have been supporting multiple APIs since graphics programming exists, it is only on the FOSS world that it keeps being touted as a pain point.

nemothekid 2 years ago | | |

DirectX is exclusive to the Windows platform. At this point, it's probably deeply tied into Windows. I don't see how you can make that bet.

aurareturn 2 years ago | | |

DirectX is closed source. Also, there are more games on Metal than DirectX.

sccxy 2 years ago |

Does M3 still outputs garbage which make external displays flicker?

https://forums.macrumors.com/threads/m1-m2-flickering-ghosti...

https://www.benq.com/en-us/knowledge-center/knowledge/how-to...

https://www.howtogeek.com/805459/mac-flickering-external-scr...

monocasa 2 years ago | |

Interestingly, that's a different component than the GPU on these chips, which is the typical architecture in SoCs.

In fact, even in discrete GPUs, the display scanout engine is generally a nearly completely independent block relative to the rest of the GPU.

lwkl 2 years ago | |

This issue can be fixed by switching the display to RGB [1]. So I think it’s a software bug but it‘s really annoying since the fix sometimes resets and the bug only occurs when there is a lot of black on the screen.

[1] https://gist.github.com/GetVladimir/c89a26df1806001543bef4c8...

sccxy 2 years ago | | |

I have two monitors connected to M1 mac.

One works perfectly fine and is automatically RGB. Other flickers and when changing to RGB mode it is lime green.

Wonder why it is not problem with Intel Macs and if M3 fixes those bugs?

Maybe it is Apples feature to sell more of their own monitors.

They make sure other high end brands do not work with macOS.

It does not make sense that this kind of bug is 3 years active.

leloctai 2 years ago |

Does the complex block in the diagram refer to complex numbers? That doesn't sound typical, does it? What type of work load that typically run on the GPU that would require complex numbers?

reroute22 2 years ago | |

Judging by the output/GUI of their GPU profiler, "complex" there is more like "complex instructions", think f32 (floating point) ops that aren't additions and multiplications (and FMAs), but trigonometry, square roots, that sort of thing.

mattsan 2 years ago | |

FFT plus some game stuff requires complex numbers to do partial rendering (e.g. do some now and then do more next frame - I've lost the link to the talk but IIRC EA did a talk on how they made a shader that emulates lights in the background that are out of focus (not Guassian but the actual cool effect as if it was a real camera))

Arelius 2 years ago | | |

bokeh?

edit: Found the article: https://www.ea.com/frostbite/news/circular-separable-convolu...

make3 2 years ago | |

quaternions? for rotations

RantyDave 2 years ago |

So ... the registers are dynamically allocated from a chunk of cache? Does this mean there, effectively, are no registers? Does this cache have one clock latency?

reroute22 2 years ago | |

I doubt anyone will be able to answer questions this fine grained, not now (if the implementation is architecturally exposed - leaks into the ISA - and Asahi Linux group figures some of it out), or possibly not ever (if it's architecturally transparent and thus entirely micro-architectural).

> Does this mean there, effectively, are no registers?

I can only point out just for context that if by any chance you're asking whether the registers are implemented as actual hardware design "registers" - individually routed and and individually accessible small strings of flip-flops or D-latches - then the history of the question is actually "it never was registers in the first place" - architectural (ISA) registers in GPUs are implemented by a chunk of addressable ported SRAM, with an address bus, data bus, and limited number of accesses at the same time and limited b/w [1].

[1] see the diagram at https://www.renesas.com/us/en/products/memory-logic/multi-po...

RantyDave 2 years ago | | |

Oh! Well, that explains that then. Wild!

josu 2 years ago |

Does the narration sound like AI to anyone else?

alphanullmeric 2 years ago |

It’s amazing how bad the competition is. The A17 pro has 2 performance cores and 4 efficiency cores. The Google G3 has 9 cores of 3 different types, the fastest being slower than Apple’s performance cores, the most efficient being less efficient than apple’s efficiency cores. And it’s a phone so you don’t take advantage of the extra parallelism. You just get the worst of both worlds. no wonder these android phones have 50% more battery and 50% less battery life. Is it that hard to just copy the winning formula?

diimdeep 2 years ago |

Really awful narration with overuse of unnecessary pitch glides

bayindirh 2 years ago | |

I don't share your sentiment about the guy, but that's OK.

Apple has a style, and he talks with that style very well, preventing the listener from drifting off.

flohofwoe 2 years ago | |

I'm German and I don't hear anything wrong about the voice. He's probably not a native speaker, give the guy some slack.

ta8645 2 years ago | |

I had a similar reaction at first. But, for whatever reason, when playing the video at 1.5x it felt completely fine.

goosinmouse 2 years ago | |

It is pretty funny how amateur a multi trillion dollar can be, its so distracting and sounds like an 80's instructional VHS.

riscy 2 years ago | | |

the narrator is an actual engineer, not a television personality

dylan604 2 years ago | | |

you could never hear the audio that cleanly on a VHS even with HiFi tracks.