GPU advancements in M3 and A17 Pro [video](developer.apple.com) |
GPU advancements in M3 and A17 Pro [video](developer.apple.com) |
"Discover new Metal profiling tools for M3 and A17 Pro"
https://developer.apple.com/videos/play/tech-talks/111374/
"Learn performance best practices for Metal shaders"
https://developer.apple.com/videos/play/tech-talks/111373/
"Bring your high-end game to iPhone 15 Pro"
These are two excellent posts that go deep on this:
The Shader Permutation Problem - Part 1: How Did We Get Here?
The Shader Permutation Problem - Part 2: How Do We Fix It?
In particular, the second post has the line:
We probably should not expect any magic workarounds for static register allocation: if a callable shader requires many registers, we can likely expect for the occupancy of the entire batch to suffer. It’s possible that GPUs could diverge from this model in the future, but that could come with all kinds of potential pitfalls (it’s not like they’re going to start spilling to a stack when executing thousands of pixel shader waves).
... And some kind of 'magic workaround for static register allocation' is pretty much what has been done.https://therealmjp.github.io/posts/shader-permutations-part1...
https://therealmjp.github.io/posts/shader-permutations-part2...
So how does one translate to an equivalent in "CUDA cores" type terminology?
For example a 1080 GTX GPU has 20 stream multiprocessors (SM), each containing 128 cores, each of which supports 16 threads.
Meanwhile Apple describes the M1 GPU as having 8 cores, where “each core is split into 16 Execution Units, which each contain eight Arithmetic Logic Units (ALUs). In total, the M1 GPU contains up to 128 Execution units or 1024 ALUs, which Apple says can execute up to 24,576 threads simultaneously and which have a maximum floating point (FP32) performance of 2.6 TFLOPs.”
So one option to get a single number for a rough comparison is to count threads. The 1080 GTX supports 40,960 threads while the M1 supports 24,576 threads.
There’s obviously a lot more to a GPU — for starters, varying clock speeds, ALUs can have different capabilities, memory bandwidth, etc. But at least counting threads gives a better idea of the processing bandwidth than talking about cores.
The fact that each SM can support 1024 threads (that's the maximum blocksize of CUDA on that card) doesn't do much for the theoretical flops. Only a fraction of those threads can be active at a time. The others are idling or waiting on their memory requests. This hides a lot of the i/o latency.
Apple has always been really good at parallelism, which is why they get so much performance from less power consumption.
The formal TFLOPS comparison as a result would be most sensible between pre-M3 designs, AMD 6000 series (RNDA 2), and Nvidia's 2000 series (Turing). After that it gets really murky with AMD's "TFLOPS" looking nearly 2x more than they are actually worth by the standards of prior architectures, followed by Nvidia (some coefficient lower than 2, but still high), followed by M3 which from the looks of it is basically 1.0x on this scale, so long as we're talking FP32 TFLOPS specifically as those are formally defined.
You can see this effect the easiest by comparing perf & TFLOPS of AMD 6000 series and Nvidia 3000 series - they have released nearly at the same time, but AMD 6000 is one gen before the "near-fake-doubling", while Nvidia's 3000 series is the first gen with the "close-to-fake-doubling": with a little effort you'll find GPUs between these two that perform very similarly (and have very similar DRAM bandwidth), but Ampere's counterpart has almost 2x the FP32 TFLOPS.
I don’t really think we can, even if we knew exactly what is in a M3 GPU core, which we don’t. Both architectures are very different, and different again from AMD GPUs. We have to count Tflops.
"The new Max"? He clearly meant "the new Macs".
Kinda weird that Apple can't properly transcribe its own content.
I suspect they have the videos transcribed externally, and don’t check the transcription (or only do so in a cursory manner).
For YT vids, especially shorts, it's because churning out shorts/reels/tiktoks of clips from longer form videos (and/or with the split screen gameplay of some mobile game/minecraft platforming run) is now a common tactic for trying to gain tons of views on your account for monetisation later.
[1] https://developer.apple.com/videos/play/tech-talks/111374?ti...
Not going to happen, what's more likely is a Proton-like layer above macOS APIs to simplify porting games over. Also see "Game Porting Toolkit" here: https://developer.apple.com/games/
https://forums.macrumors.com/threads/m1-m2-flickering-ghosti...
https://www.benq.com/en-us/knowledge-center/knowledge/how-to...
https://www.howtogeek.com/805459/mac-flickering-external-scr...
In fact, even in discrete GPUs, the display scanout engine is generally a nearly completely independent block relative to the rest of the GPU.
[1] https://gist.github.com/GetVladimir/c89a26df1806001543bef4c8...
One works perfectly fine and is automatically RGB. Other flickers and when changing to RGB mode it is lime green.
Wonder why it is not problem with Intel Macs and if M3 fixes those bugs?
Maybe it is Apples feature to sell more of their own monitors.
They make sure other high end brands do not work with macOS.
It does not make sense that this kind of bug is 3 years active.
edit: Found the article: https://www.ea.com/frostbite/news/circular-separable-convolu...
> Does this mean there, effectively, are no registers?
I can only point out just for context that if by any chance you're asking whether the registers are implemented as actual hardware design "registers" - individually routed and and individually accessible small strings of flip-flops or D-latches - then the history of the question is actually "it never was registers in the first place" - architectural (ISA) registers in GPUs are implemented by a chunk of addressable ported SRAM, with an address bus, data bus, and limited number of accesses at the same time and limited b/w [1].
[1] see the diagram at https://www.renesas.com/us/en/products/memory-logic/multi-po...
Apple has a style, and he talks with that style very well, preventing the listener from drifting off.
I watched the full video and thought it was excellent. I wish other CPU/GPU manufacturers made technical overview videos like this. I've never programmed graphics targeting metal before, but I feel much more inclined after watching this so I guess it was good advertising.
Have you dived into them?
There are advanced (for me) sessions like:
https://developer.apple.com/videos/play/wwdc2023/10127/
https://developer.apple.com/videos/play/wwdc2023/10042/
Although it's true that they won't discuss at hardware level.
It will still be years before it is practical for Linux developers to target these features. Eventually, the rate of change in GPU design will slow and Linux will catch up once and for all. But it's hard to not drool over the hardware that proprietary OSs get to use today.
There are some hyperbole interjected about how incredible the performance is but that's only in between the useful data. (I did chuckle at the… enthusiasm of the speaker though)
This isn't a technical document for GPU designers. Apple doesn't really need or want you to understand exactly how the implementation works because that's basically trade secret for them. This is aimed at letting app / game developers know how they should optimize for the new GPUs, since previously Apple just made some ambiguous remarks about some of these new technology ("Dynamic Caching") without explaining what they meant.
But yes, I do like how the Asahi folks tend to end up documenting a lot of how these hardware works, but they also only have public information like this to start from so these are still useful info to have for them.
Somewhere in the ballpark of 5:30-6:00 or so it describes prior hardware design of the Apple's shader core, and starting 7:00 it goes into hardware design of the new M3/A17Pro shader core. It's actually surprisingly detailed, e.g. Nvidia's whitepapers provide less detail on the actual organization of their SMs.
Anyone else literally cannot compete, they don't have billions in pocket change they don't know how to spend otherwise, so they'll have to wait until the exclusivity agreement expires.
your parent comment's example is literally Google, world-class experts at burning money on developers producing a million dead-end products and abandoning them a year later.
if Google would get some sensible leadership, focus on a few core products, and stick with them for a decade, they'd have just as much money to spend. But "focus" and "Google" seem to have become opposites.
My point: the 'winning formula' of Apple is laser-sharp focus: have a few products, do them as well as anyone else or better, and only introduce a new product if it is mature-ish and very profitable. (We'll see how the vision headset fits in here)
So it's something they took advantage of after they grew (well, which company at their scale wouldn't ask for the best wholesale deals?), but not what made them big in the first place.
"Is it that hard to just copy the winning formula?"
yes it is, thanks to IP law. And back in the day Steve Jobs already wanted thermonuclear war on Samsung, because he felt their flagship at the time was too close to the IPhone.
It's still somewhat interesting because threads are a low-level programming primitive. If you can come up with work for 40k simultaneous threads, you can use the GPU effectively. For some tasks this parallelization is obvious (a HD video frame has 2 million pixels and shading them independently is trivial), and of course often it's anything but.
I see if with every major product announcement, the worst are usually Apple threads but it’s not constrained only there.
Btw. it appears you are shadow banned. You might want to check some of your comments and then contact dang.
They also aimed at markets that are ripe for disruption, because of weak competition: The MP3 player market before the iPod, the PDA-with-a-SIM-card market before the iPhone, etc. pp. all could be reasonably disrupted by just delivering a reasonably (but not even best-in-class, specs wise) product with better UX (not hard, in the cases mentioned) and massive marketing. You can't do that in a heavily competitive market that's already full of these products. VR headsets are probably closer to the "ripe for disruption" end of the spectrum, and I think the Vision will probably do well. But I doubt the "Apple Car" plans that have been floating around for 10 years now will ever lead to anything.
What heavily competitive market, the mobile phone duopoly?
Google has half the market, they are not the poor incumbent that doesn't have enough money to be disruptive.
The iPod's only notable hardware that wasn't just a random off the shelf part was the click wheel, the chips were all off-the-shelf (until old iPhone chips counted as that), and iPhones didn't get custom chips until the 4.
So I guess the other part of the winning formula is "use market dominance in one sector to subsidize expansion into the next". I guess that's indeed one area where Google could reasonably try to be less inept, but I think all the institutional inertia makes that impossible by now. They'll go the DEC route of just drowning in their own internal problems until someone buys them up.
The iPod/iPhone yes, but "in the first place" the App Store was insignificant (the remenues at 2010 was < 2 billion dollars worldwide, so Apple's take was less than $600 million).
For comparison the iPod had that profit already in 2004, and around 3 billion in 2010 (when the iPhone had already started replacing it).
So, the App Store was hardly ludicrous revenue for its first 3-4 years, in fact less than 10% of Apple's revenue. The iTunes store even less so.
It's the iPod and then iPhone that made Apple's dominance. The big store profit came later (and the iTunes/Music profit never was that big).
They are low utilization, but apparently still worth it because process node changes have made more ALUs take relatively little area. So doubling the ALU count, even with low utilization is still apparently an overall benefit (ie, there wasn't something better to spend that die space on).
Sorry, I was around for DirectX 1.0 back when GPUs were called "graphics accelerators", and don't see how that's possible.
Do you have a source for that, or is there some implicit caveat like counting some emulation later or something? Even then...
Across platforms, there's no chance in hell that more games run on Metal than DirectX, not even when counting iOS shovelware.
24,576 threads (or however many, I didn't validate the number and it depends on the occupancy, which depends on thread resource usage, like registers => depends on the shader program code) is how many threads can be executed concurrently (as opposed to in parallel), as in, how many of them can simultaneously reside on the GPU. A subset of those at any time are actually executed in parallel, the rest are idle.
You can think of this situation as follows using an analogy with a CPU and an OS:
1. 128 * the-number-of-cores is the number of CPU cores(*1)
2. 24,576 threads is the number of threads in the system that the OS is switching between
Major differences with the GPU:
3. On a CPU context switch (getting a thread off the core, waking up a different thread, restoring the context, and proceeding) takes about 2,000 cycles. On a GPU _from the analogy_ that kind of thread switching takes ~1-10 cycles depending on the exact GPU design and various other details.
4. In CPU/OS world the context switching and scheduling on the OS side is done mostly in software, as the OS is indeed software. In GPU's case the scheduler and all the switching is implemented as fixed function hardware finely permeating the GPU design.
5. In CPU/OS world those 2,000 cycles per context switch is so much larger than a roundtrip to DRAM while executing a load instruction that happened to miss in all caches - which is about 400-800 cycles or so depending on the design - that OS never switches threads to hide latencies of loads, it's pointless. As far as performance is concerned (as opposed to maintaining the illusion of parallel execution of all programs on the computer), the thread switching is used to hide the latency of IO - non-volatile storage access, network access, user input, etc. (which takes millions of cycles or more - so it makes sense).
In the GPU world the switching is so fast, that the hardware scheduler absolutely does switch from thread to thread to hide latencies of loads (even the ones hitting in half of the caches, if that happens), in fact, hiding these latencies and thus keeping ALUs fed is the whole point of this basic design of pretty much all programmable GPUs that there ever were.
6. In real world CPU/OS, the threads that aren't running at the time reside (their local variables, etc) in the memory hierarchy, technically, some of it ends up in caches, but ultimately, the bulk of it on a loaded system is in system DRAM. On a GPU, or I suppose by now we have to say, on a traditional GPU, these resident threads (their local variables, etc) reside in on-chip SRAM that is a part of the GPU cores (not even in a chunk on a side, but close to execution units in many small chunks, one per core). While the amount of DRAM (CPU/OS) is a) huge, gigabytes, and b) easily configurable, the amount of thread state the GPU scheduler is shuffling around is measured typically in hundreds of KBs per GPU Core (so on the order of about "a few MBs" per GPU), and the equally sized SRAM storing this state is completely hardwired in the silicon design of the GPU and not configurable at all.
Hope that helps!
footnotes (*1) a better analogy would be not "number of CPU cores", but "number-of-CPU-cores * SMT(HT) * number-of-lanes-in-AVX-registers", where number-of-lanes-in-AVX-registers is basically "AVX-register-width / 32" for FP32 processing which (the latter) yields about ~8 give or take 2x depending on the processor model. Whether to include SMT(HT) multiplier (2) in this analogy is also murky, there is an argument to be made for yes, and an argument to be made for no, and depends on the exact GPU design in question.
128 = 4 (physical cores) * 2 (hyperthreading) * 8 (AVX2 f32 lanes) * 2 (floating point ports per core)
Also, your "128 cuda cores" of Skylake variety run at higher frequencies and work off of much bigger caches, so they are faster (in serial manner)...
...until they are slower, because GPU's latency hiding mechanism (with occupancy) hides load latencies very well, while CPU just stalls the pipeline on every cache miss for ungodly amounts of time...
...until they are faster again when the shader program uses a lot of registers and GPU occupancy drops to the floor and latency hiding stops hiding that well.
But core counts - yes, more or less.
If it's the latter, it's still correct to say that FP32 is king in mobile graphics.
Let’s guess: maybe they have different drivers? Would be far from surprising, given that they’re running on different processors.
They advised me to buy Apple Display.
How very odd.
Imagination Technologies is a near 40 year old British silicon IP company that has been doing GPUs for quite some time, just not ones supporting DX up until now, and it has nothing to do with MS (in terms of ownership / rights / etc).
[1] https://learn.microsoft.com/en-us/windows/win32/direct3d11/o...
Similarly, Microsoft would need to release the non GPU specific bits for macos to fit the same model.
Of course the idea of Apple switching from Metal to D3D is rubbish, but a Proton-like solution for macOS totally makes sense (maybe the "Game Porting Toolkit" will be exactly that eventually).
Perf of a GPU can be limited by any one of the thousand little things within the micro-architectural organization of the GPU in question, any on-chip path can become the bottleneck:
1. DRAM bandwidth
2. ALU counts
3. Occupancy
4. Instruction issue port counts
5. Quality of Warp scheduling (the scheduling problem)
6. Operand delivery
7. Any given cache bandwidth
8. Register file bandwidth (SRAM port counts)
9. Head of the line blocking in one of the many queues / paths in the design whatever that path is responsible for:
- sending memory request from the instruction processing pipelines to the memory hierarchy - or sending the reply with the data payload back, - or doing the same but with the texture filtering block (rather than memory H), - or the path that parses GPU commands from command buffers created by the driver, - or the path that subsequently processes those already decoded commands and performs on-chip resource allocation, warp creation / tear down, all of which need to be able to spawn the work further down fast enough to keep the rest of the design fed;
and so on and so on and so on.
By the time a high quality design is fully finished, matured, and successful enough on the market to show up on everyone's radar outside of the hardware design space, due to the commonly occurring ratios of costs of solutions for these various problems above, it usually ends up being 1, 2, or 3, but that's experimental data + statistics + survivorship bias, there is no "definition" that that's the case.
Further, what's "commonly occurring" is changing over time as designs drift into different operational areas in the phase space of operating modes as they pick out low-hanging fruits, science and experience behind the micro-architecture grows, common workloads change in nature, and new process nodes with new properties become the norm. Doubling up of F32 ALUs in Ampere is a good example of that, that was done in a way that changed the typical ratios substantially. And now M3 threw a giant wrench into incumbent statistics of relationships between (3) and the rest of the limiters, as well as between (3) and what's actionable for a GPU program developer to do to mend it.
You can be low on DRAM bandwidth util and ALU util at the same time. How would that be if there were no other limiters?
Generally, a component X of a computer system needs to be a limiter Y% of the time where Y equals the portion of the total cost of the system X is responsible for.
The principle is the easiest to apply in a "calculus of variations" manner: if doubling the key performance metric of X results in an increase of the cost of the entire system as a whole by 5%, but how often X is the limiter drops from 10% of the time to 5% of the time, doing the doubling would bring the design quite close to proper balance wrt. X.
Things that are cheap to beef up are relatively rare a limiter in well-designed systems as a result. Things that are expensive to beef up are often the limiter. What is and isn't expensive to do depends heavily on where in the design space the current design is at and where the technology is at, all of which is changing over time.
FP32 was cheap to double up in Ampere, so they doubled it up, even though that provided only relativelyl small performance improvement. But now as a result, FP32 is very rarely a limiter (in Ampere and Ada). That doesn't automatically mean that these designs are "gimped" in DRAM bandwidth or anything of the kind. Rather, the whole perception that a good GPU design just gotta be ALU limited all the time is nothing but a mistaken perception, just like "it's either ALU limited or DRAM bandwidth limited by definition" also is just untrue. See "occupancy limited" for a prime example.
Other monitors work without flicker in Intel Macs, and suddenly M1 Macs have flicker issue...
Solution: "Buy Apple Display"
In kernel space you as the hardware manufacturer implement a display miniport driver, and heavily lean on dxgkrnl.sys
https://learn.microsoft.com/en-us/windows-hardware/drivers/d...
And implement interfaces like this one for D3D12 https://learn.microsoft.com/en-us/windows-hardware/drivers/d...
There's quite a bit though and it's kind of spread across several sections on MSDN.
> ...until they are slower, because GPU's latency hiding mechanism (with occupancy) hides load latencies very well, while CPU just stalls the pipeline on every cache miss for ungodly amounts of time...
Is the GPU latency hiding mechanism equivalent to SMT/Hyperthreading, but with more threads per physical core? Or is there more machinery?
Also, how akin GPUs "stream multiprocessors"/cores are to CPUs ones at the microarchitectural level? Are they out-of-order? Do they do register renaming?
A "Core" (Apple's term) / "Compute Unit" (AMD) / "Streaming Multiprocessor" (Nvidia) / "Core" (CPU world). This is the basic unit that gets replicated to build smaller/larger GPUs/CPUs
* Each "Core/CU/SM" supports 32-64 waves/simdgroups/warps (amd/apple/nvidia termology), or typically 2 threads (cpu terminology for hyperthreading). ie, this is the unit that has a program counter, and is used to find other work to do when one thread is unavailable. (this blurred on later Nvidia parts with Independent Thread Scheduling.)
* The instruction set typically has a 'vector width'. 4 for SSE/NEON, 8 for AVX, or typically 32 or 64 for GPUs (but can range from 4 to 128)
* Each Core/CU/SM can execute N vector instructions per cycle (2-4 is common in both CPUs and GPUs). For example, both Apple and Nvidia GPUs have 32-wide vectors and can execute 4 vectors of FP32 FMA/cycle. So 128 FPUs total, or 256 FMAs/cycle Each of these FPUs what Nvidia calls a "Core", which is why their core counts are so high.
In short, the terminology exchange rate is 1 "Apple GPU Core" == 128 "Nvidia GPU Cores", on equivalent GPUs.
> how akin GPUs "stream multiprocessors"/cores are to CPUs ones at the microarchitectural level?
I'd say, if you want to get a feel for it in a manner directly relevant to recent designs, then reading through [1], [2], subsequent conversation between the two, and documents they reference should scratch that curiosity itch well enough, from the looks of it.
If you want a much more rigorous conversation, I could recommend the GPU portion of one of the lectures from CMU: [3], it's quite great IMO. It may lack a little bit in focus on contemporary design decisions that get actually shipped by tens of millions+ in products today and stray to alternatives a bit. It's the trade-off.
> Are they out-of-order?
Short answer: no.
GPUs may strive to achieve "out of order" by picking out a different warp entirely and making progress there, completely circumventing any register data dependencies and thus any need to track them, achieving a similar end objective in a drastically more area and power efficient manner than a Tomasulo's algorithm would.
> Do they do register renaming?
Short answer: no.
[1] https://forums.macrumors.com/threads/3d-rendering-on-apple-s...
[2] https://forums.macrumors.com/threads/3d-rendering-on-apple-s...
[3] https://www.youtube.com/watch?v=U8K13P6loyk ("Lecture 15. GPUs, VLIW, Execution Models - Carnegie Mellon - Computer Architecture 2015 - Onur Mutlu")
The OP said Apple should switch to DirectX. They can't because it's closed source. The reason given was that devs won't support 2 APIs. They already do. They support DirectX and Vulcan (PS5 version) almost always. And devs already support Metal - with more games than the latest DirectX.
Also many developers support even more than 2, as you are fogetting Playstation and Switch, both of which with their own APIs, 2 for Playstayon (GNM and GNMX), 3 for Switch (NVN, Vulkan and OpenGL).
>Also many developers support even more than 2, as you are fogetting Playstation and Switch, both of which with their own APIs, 2 for Playstayon (GNM and GNMX), 3 for Switch (NVN, Vulkan and OpenGL).
I didn't imply that developers only support 2.
Since it is a fact I am expecting a number for them as well.
Also you naturally also have a number for the ones that are still using the legacy OpenGL ES API, or the MoltenVK wrapper on iDevices.
As in facts from certified market analysis company, showing actual number of players per platform.