Accelerated PyTorch Training on M1 Mac(pytorch.org) |
Accelerated PyTorch Training on M1 Mac(pytorch.org) |
1.) Apple Silicon currently can't compete with Nvidia GPUs in terms of raw compute power, but they're already way ahead on energy efficiency. Training a small deep learning model on battery power on a laptop could actually be a thing now.
Edit: I've been informed that for matrix math, Apple Silicon isn't actually ahead in efficiency
2.) Apple Silicon probably will compete directly with Nvidia GPUs in the near future in terms of raw compute power in future generations of products like the Mac Studio and Mac Pro, which is very exciting. Competition in this space is incredibly good for consumers.
3.) At $4800, an M1 Ultra Mac Studio appears to be far and away the cheapest machine you can buy with 128GB of GPU memory. With proper PyTorch support, we'll actually be able to use this memory for training big models or using big batch sizes. For the kind of DL work I do where dataloading is much more of a bottleneck than actual raw compute power, Mac Studio is now looking very enticing.
1) Nope. For neural network training not the case: https://tlkh.dev/benchmarking-the-apple-m1-max
And that's with the 3090 set at a very high 400W power limit, can get far more efficient when clocked lower.
(which is normal, because no dedicated matrix math accelerators on the GPU notably)
2) We'll see, hopefully Apple thinks that the market is worth bothering with... (which would be great)
3) Indeed, if you need a giant pool of VRAM above everything else at a relatively low price tag, Apple is indeed a quite enticing option. If you can stand Metal for your use case of course.
For raw compute like you need for ML training, the M1s efficiency doesn't matter. Under the hood at hardware level, you have a direct mapping of power consumption to compute circuit activation that you really can't get around.
The general efficiency of M1 is due its architecture and how it fits together with normal consumer use. Less stuff on the instruction decode, more efficient reordering, less energy wasted moving around data due to shared memory architecture, e.t.c
The comparison of efficiency between Apple and Nvidia here is a bit misleading because one compares Apples general-purpose ALUs to Nvidia’s specialized ALUs. For a more direct efficiency comparison, one would need to compare the Tensor Cores against the AMX or ANE coprocessors.
As to how Apple achieves such high efficiency, nobody knows. The fact that they are on 5nm node might help, but there must be something special about the ALU design as well. My speculation is that they are wider and much more simpler than in other GPUs, which directly translates to efficiency wins.
Apple is simply behind in the GPU space.
> At $4800, an M1 Ultra Mac Studio appears to be far and away the cheapest machine you can buy with 128GB of GPU memory. With proper PyTorch support, we'll actually be able to use this memory for training big models or using big batch sizes. For the kind of DL work I do where dataloading is much more of a bottleneck than actual raw compute power, Mac Studio is now looking very enticing.
The reason why it's cheaper is that its memory is at a fraction (around 20-35%) of the memory bandwidth of a 128GB equivalent GPU set up, which also has to be split with the CPU. This is an unavoidable bottleneck of shared memory systems, and for a great many applications this is a terminal performance bottleneck.
That's the reason you don't have a GPU with 128GB of normal DDR5. It would just be quite limited. Perhaps for some cases it can be useful.
Here's some info about M1 memory bandwidth: https://www.anandtech.com/show/17024/apple-m1-max-performanc...
It seems like this is ideal as an accelerator for already trained models; one can imagine Photoshop utilizing it for deep-learning based infill-painting.
I was doing training on battery with a laptop that had a 1080 and could do training; I have trained models on the airplane while totalyl unplugged and still had enough power to websurf afterwards.
- Apple undoubtedly owns the densest nodes, and will fight TSMC tooth-and-nail over first dibs on whatever silicon they have coming next.
- Apple's current GPU design philosophy relies on horizontally scaling the tech they already use, whereas Nvidia has been scaling vertically, albeit slowly.
- Nvidia has insane engineers. Despite the fact they're using silicon that's more than twice as large by-area when compared to Apple, they're still doubling their numbers across the board. And that's their last-gen tech too, the comparison once they're on 5nm later this summer is going to be insane.
I expect things to be very heated by the end of this year, with new Nvidia, Intel and potentially new Apple GPUs.
Interesting observation. I wonder is the biggest memory iGPU configuration you can get on the x86 side?
Better would be mobilenets or efficientNets or NFNets or vision transformers or almost anything that's come out in the 8 years since VGG was published (great work it was at the time!).
pip3 install --pre torch==1.12.0.dev20220518 --extra-index-url https://download.pytorch.org/whl/nightly/cpu
In both cases the unified memory machines outperformed much larger machines in specific use cases.
- It needs extremely high-bandwidth controllers, which severely limits the amount of memory you can use (Intel Macs could be configured with an order of magnitude more ram in it's server chips)
- ECC is still off-the-table on M1 apparently
- Most workloads aren't really constrained by memory access in modern programs/kernels/compilers. Problems only show up when you want to run a GPU off the same memory, which is what these new Macs account for.
- Most of the so-called "specific workloads" that you're outlining aren't very general applications. So far I've only seen ARM outrun x86 in some low-precision physics demos, which is... fine, I guess? I still don't foresee meteorologists dropping their Intel rigs to buy a Mac Studio anytime soon.
I'm curious on how the benchmarks change with this recent new release!
$ conda install pytorch torchvision torchaudio -c pytorch-nightly
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
PackagesNotFoundError: The following packages are not available from current channels:
- torchaudio
And the pip install variant installs an old version of torchaudio that is broken OSError: dlopen(/opt/homebrew/Caskroom/miniforge/base/envs/test123/lib/python3.10/site-packages/torchaudio/lib/libtorchaudio.so, 0x0006): Symbol not found: __ZN2at14RecordFunctionC1ENS_11RecordScopeEb pip3 install pytorch
worked for me. I think it's something with your brew installation. fragmede@samairmac:~$ python
Python 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:24:02)
[Clang 11.1.0 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__file__
'/Users/fragmede/projects/miniforge3/lib/python3.9/sitepackages/torch/__init__.py'
>>>The neural engine is small and inference only. It's also only exposed by a far higher level interface, CoreML.
Where it could still make sense is if you have a small VRAM pool on the dGPU and a big one on the M1, but with the price of a Mac, not sure that makes a lot of sense either in most scenarios compared to paying for a big dGPU.
That's because the M1 has a dedicated matrix math accelerator called AMX [1]. I've used it with both Swift and pure C.
https://medium.com/swlh/apples-m1-secret-coprocessor-6599492...
Why is it inference only? At least the operations are the same...just a bunch of linear algebra
It is not really comparable on a step per second level but the power consumption and now GPU memory will make it pretty enticing.
(pytorch_env) ~/dev/ai/ python -c "import torch"
Note that for ch<O(512) (varies by GPU & hw) you tend to be memory-transfer-speed limited, not compute limited.
So unfortunately depthwise convolutions end up having terrible performance.
Note that pointwise 1x1 convolutions are a special case of group convolutions and actually I think they might be specially optimized in PyTorch (I’d have to run some benchmarks to test it though).
Since there are no hard benchmarks against other GPUs, here's a Geekbench against an RTX 3080 Mobile laptop I have [1]. Looks like it's about 2x slower--the RTX laptop absolutely rips for gaming, I love it.
[1] https://browser.geekbench.com/v5/compute/compare/4140651?bas...
What do shaders have to do with it? Deep learning is a mature field now, it shouldn't need to borrow compute architecture from the gaming/entertainment field. Anyone else find this disconcerting?
Why is that concerning to you?
People new to CG are likely to intuit “shaders” as something related to, well, shading, but vertex shaders et al have nothing to do with the color of a pixel or a polygon.
And there are "GPUs" today that can't do graphics at all (AMD MI100/MI200 generations) or in a restricted way (Hopper GH100) which has the fixed function pipeline only on two TPCs, for compatibility, but running very slowly due to that.
Concessions towards compute: a C++ programming language for device code (totally unlike what's done for most graphics APIs!)
Concessions towards graphics: no single-source programming model at all for example...
Why not? It's still good for simple classification tasks. We use it as an encoder for a segmentation model in some cases. Most ResNet variants are much heavier.
https://www.kaggle.com/code/jhoward/which-image-models-are-b...
Those slow and inaccurate models at the bottom of the graph are the VGG models. A resnet34 is faster and more accurate than any VGG model. And there are better options now -- for example resnet34d is as fast as resnet34, and more accurate. And then convnext is dramatically better still.
https://github.com/jcjohnson/cnn-benchmarks#:~:text=ResNet%2....
Probably because it makes the hardware look good.
It makes me feel like i’m missing something! Is is still used as a backbone in the same way as legacy code is everywhere, or is it something else entirely??
It's an unfortunate set of terminology due to the way this space evolved from graphics programming - shader cores used to do fixed-function shading! But then people wanted them to be able to run arbitrary shaders and not just fixed-function. And then hey, look at this neat processor, let's run a compute program on it. At first that was "compute shaders" running across graphics APIs, then came CUDA, and later OpenCL. But it is still running on the part of the hardware that provides shading to the graphics pipeline.
Similarly, texture memory actually used to be used for textures, now it is a general-purpose binding that coalesces any type of memory access that has 1D/2D/3D locality.
You kinda just get used to it. Lots of niches have their own lingo that takes some learning. Mathematics is incomprehensible without it, really.
Inference also prefers different IO patterns, because you don't need to keep the activations for every layer ready for backpropogation.
Think about changing the model every other year: - 2015: ResNet trained in Nvidia k80 - 2017: Inception trained in Nvidia 1080 ti - 2019: Transformer trained in Nvidia V100 - 2021: GTP-3 trained in a cluster
Now you have your new fancy algorithm X and an Nvidia 4090. How much better is your algorithm compared to the state of the art, and how much have you improved compared to the algorithms 5 years ago? Now you are in a nightmare and you have to run all the past algorithms in order to compare it. Or how fast is the new Nvidia card? which noone still have and nvidia has decided to give numbers based on a their own model?
You can fuse grouped convs (depthwise is a special case of grouped convs) into preceding or following layers. Maybe JAX can do this already? No clue if any library offers such an optimization out of the box
Texture units are indeed a part that is useful enough to be exposed to GPGPU compute APIs directly. The "shader" term itself disappeared quite early in those though, as did access to a good part of the FF pipeline including the rasterisers themselves.
Super high level (from section 3):
1. Converting the model to use the float16 data type where possible.
2. Keeping float32 master weights to accumulate per-iteration weight updates.
3. Using loss scaling to preserve small gradient values.
[0] https://docs.nvidia.com/deeplearning/performance/mixed-preci...Edit: Fixed second link.
(Also, many inference accelerators use lower precision than you do when training)
There are tricks you can do to use inference to accelerate training, such as one we developed to focus on likely-poorly-performing examples: https://arxiv.org/abs/1910.00762
You can't even poke the ANE hardware directly from a regular process. The interface for accessing the neural engine is not hardened (you can easily crash the machine from it).
So the matter is essentially moot in practice as you'd need your users to run with SIP off...
On the contrary, a 3080 laptop does reach 400GB/s, I'm personally seeing this routinely on AI workloads, so that's part of the explanation for subpar perf here (the other ones being probably matrix math and mixed precision)
A6000 is ~$5k per card. I guess you're referring to something like an A100 on that other spec, which is $10k/card (for 40GB of memory).
I do a fair bit of neural/AI art experimentation, where memory on the execution side is sometimes a limiting factor for me. I'm not training models, I'm not a hardcore researcher--those folks will absolutely be using NVIDIA's high-end stuff or TPU pods.
128GB in a Studio is super compelling if it means I can up-res some of my pieces without needing to use high-memory-but-super-slow CPU cloud VMs, or hope I get lucky with an A100 on Colab (or just pay for a GPU VM).
I have a 128GB/Ultra Studio in my office now. It's a great piece of kit, and a big reason I splurged on it--okay, maybe "excuse"--was that I expect it'll be useful for a lot of my side project workloads over the next couple of years...
However, for lower precisions (which is what deep learning uses), you're much better off with a GPU.
30Tflops for a 3080 for vector FP32, but 119Tflops FP16 dense with FP16 accumulate, 59.5 with FP32 accumulate, and if you exploit sparsity then that can go even higher.
Also related: Apple designs their hardware to do just what they want it to while everyone else is designing for a more general use case. This also costs die area, IP licensing fees etc.
Of course, my perspective here might be extremely naive, I know very little about semiconductor technology, just trying to understand the principal design differences.
For sure but I expect this is different for the apps Apple _wants_ to write. It’s easy to imagine the next version of Logic or whatever doing fine tuning everywhere.
In the first half of 2023, NVIDIA Grace Superchip will ship with an 1TB memory config (930GB usable because ECC bits) on a 1024-bit wide LPDDR5X-8533 config (same width as M1 Ultra, with LPDDR5-6400).
So it's going to become much less of an issue really soon.
The main issue would be trying to purchase one of those, which is likely going to be both very rare and orders of magnitude more expensive than a Mac Studio.
The Mac Studio isn't some crazy exotic hardware like datacenter class GPUs, but definitely has some exotic capabilities.
Datacenter class GPUs are expensive yeah, but are quite easy to buy, even in a single unit amount.
example: https://www.dell.com/en-us/work/shop/nvidia-ampere-a100-pcie... for the first random link, but there are other stores selling them for significantly cheaper.
I wonder what their CPU pricing will be though... we'll see I guess.
FP64 is also supported by AMX, making it quite an impressive region of silicon.
Only the workflow is the custom part--the core here is literally the original jcjohnson implementation. Occasionally I look around at recent work in the area, but most seems focused on fast (video-speed) inference or pre-baked style models. I've never seen something that retains artistic flexibility.
My original gut feeling on style transfer was that it would be possible to mold it into a neat tool, but most people bumped into it, ran their profile photo against Starry Night, said "cool" and bounced off. And I get that--parameter tuning can be a sloooow process. When I really explore a series with a particular style I start to feed it custom content images made just for how it's reacting with various inputs.
Here's a piece that just finished a few minutes ago: https://mwegner.com/misc/styled_render-BMrHXWz_2RBaUq8pAYKfL...
That's from a local server in my garage with a K80. At some point I had two K80s in there (so basically four K40s with how they work), but dialed it back for power consumption/power reasons.
I do have a 3090 in the house, and a decent amount of cloud infra that I sometimes tap. The jcjohnson implementation is so far back that it doesn't even run against modern hardware. At some point I need to sort that out, or figure out how to wrangle a more modern implementation into behaving in the way that I like.
I don't really post these anywhere, although do throw them over the wall on Twitter if anyone is curious to see more. These are a mix of things, although the CLIP/Midjourney/etc stuff is pretty easy to spot: https://twitter.com/mwegner/media
As for 128GB memory on-inference models that a consumer would be interested in, I got nothing, though it certainly seems like it would be fun to mess around with haha
As a concrete example, on a camera you might want to run a facial detector so the camera can automatically adjust its focus when it sees a human face. Or you might want a person detector that can detect the outline of the person in the shot, so that you can blur/change their background in something like a Zoom call. All of these applications are going to work better if you can run your model at, say, 60 HZ instead of 20 HZ. Optimizing hardware to do inference tasks like this as fast as possible with the least possible power usage it pretty different from optimizing for all the things a GPU needs to do, so you might end up with hardware that has both and uses them for different tasks.
When I learned and used gradient descent, you had to analytically determine your own gradients (https://web.archive.org/web/20161028022707/https://genomics....). I went to grad school to learn how to determine my own gradients. Unfortunately, in my realm, loss landscapes have multiple minima, and gradient descent just gets trapped in local minima.
If you don't mind about learning the part where you program, it's got a lot of beginner/intermediate concepts clearly explained. If you do dive into the programming examples, you get to play around with a few architectures and ideas and you're left on the step to dive into the more advanced material knowing what you're doing.
I'm not sure when or why this started.
The model is literally "inferring" something about its inputs: e.g., these pixels denote a hot dog, those don't.
Training is learning the weights (millions or billions of parameters) that control the model's behavior, vs inference is "running" the trained model on user data.
Instead of using gradient descent, we used molecular dynamics (I'm unaware if this has a direct equivalent) to sample the space by moving along various isocontours (constant energy, or constant temp, or usually constant pressure). Even so, you have to do a lot of sampling- in my day, it was years of computer time, now it's months- to get a good approximation to the total landscape, and measure transition frequencies between areas of the landscape that correspond to energy barries (local maxima) that are smaller than the thermal energy avaialble to the system.
It's complicated. also, deep mind obviated all my work by providng that sequence data (which is cheap to obtain) can be used to predict very accurate structures with little or no simulation.