Accelerated PyTorch Training on M1 Mac

Accelerated PyTorch Training on M1 Mac(pytorch.org)

443 points by tgymnich 4 years ago | 146 comments

lekevicius 4 years ago |

Curiously neither PyTorch nor Tensorflow currently use M1's Neural Engine. Is too limited? Too hard to interact with? Not worth the effort?

why_only_15 4 years ago | |

The ANE only has support for calculations with fp16, int16 and int8 all of which are too small to train with (too much instability). A common thing to do is train in fp32 to be able to get the small differences and gradients and then once the model is frozen do inference on fp16 or bf16.

jph00 4 years ago | | |

Using mixed precision training you can do most operations in fp16 and just a few in fp32 where it's needed. This is the norm for NVIDIA GPU training nowadays. For instance using fastai add `.to_fp16()` after your learner call, and that happens automatically.

RicoElectrico 4 years ago | |

Most probably Neural Engine is optimized for inference, not training.

munro 4 years ago | | |

That /sounds/ right, but training still has a forward part, so OP does raise a really great question. And looking at the silicon, the neural engine is almost the size of the GPU. Really need someone educated in this area to chime in :)

sillyinseattle 4 years ago | | |

Question about terminology (no background in AI). In econometrics, estimation is model fitting (training, I guess), and inference refers to hypothesis testing (e.g. t or F tests). What does inference mean here?

alexfromapex 4 years ago |

Since it's tangentially relevant, if you have an M1 Mac I've created some boilerplate for working with the latest Tensorflow with GPU acceleration as well: https://github.com/alexfromapex/tensorexperiments . I'm thinking of adding a branch for PyTorch now.

masklinn 4 years ago | |

Did you compare that to Apple's tf plugin to see what was what?

galoisscobi 4 years ago | |

This is great! Appreciate the note on H5Py troubleshooting as well.

mkaic 4 years ago |

This is really cool for a number of reasons:

1.) Apple Silicon currently can't compete with Nvidia GPUs in terms of raw compute power, but they're already way ahead on energy efficiency. Training a small deep learning model on battery power on a laptop could actually be a thing now.

Edit: I've been informed that for matrix math, Apple Silicon isn't actually ahead in efficiency

2.) Apple Silicon probably will compete directly with Nvidia GPUs in the near future in terms of raw compute power in future generations of products like the Mac Studio and Mac Pro, which is very exciting. Competition in this space is incredibly good for consumers.

3.) At $4800, an M1 Ultra Mac Studio appears to be far and away the cheapest machine you can buy with 128GB of GPU memory. With proper PyTorch support, we'll actually be able to use this memory for training big models or using big batch sizes. For the kind of DL work I do where dataloading is much more of a bottleneck than actual raw compute power, Mac Studio is now looking very enticing.

my123 4 years ago | |

> but they're already way ahead on energy efficiency

1) Nope. For neural network training not the case: https://tlkh.dev/benchmarking-the-apple-m1-max

And that's with the 3090 set at a very high 400W power limit, can get far more efficient when clocked lower.

(which is normal, because no dedicated matrix math accelerators on the GPU notably)

2) We'll see, hopefully Apple thinks that the market is worth bothering with... (which would be great)

3) Indeed, if you need a giant pool of VRAM above everything else at a relatively low price tag, Apple is indeed a quite enticing option. If you can stand Metal for your use case of course.

highfrequency 4 years ago | | |

What do you mean by: "if you can stand Metal for your use case?" What is Metal?

ActorNightly 4 years ago | |

> but they're already way ahead on energy efficiency.

For raw compute like you need for ML training, the M1s efficiency doesn't matter. Under the hood at hardware level, you have a direct mapping of power consumption to compute circuit activation that you really can't get around.

The general efficiency of M1 is due its architecture and how it fits together with normal consumer use. Less stuff on the instruction decode, more efficient reordering, less energy wasted moving around data due to shared memory architecture, e.t.c

ribit 4 years ago | | |

And yet somehow Apples GPU ALUs are more efficient at 3.8 watts per TFLOP. Mind, I am not talking about specialized matrix multiplication units that have a different internal organization and can do things like matrix multiplication much more efficiently, but about basic general-purpose GPU ALUs.

The comparison of efficiency between Apple and Nvidia here is a bit misleading because one compares Apples general-purpose ALUs to Nvidia’s specialized ALUs. For a more direct efficiency comparison, one would need to compare the Tensor Cores against the AMX or ANE coprocessors.

As to how Apple achieves such high efficiency, nobody knows. The fact that they are on 5nm node might help, but there must be something special about the ALU design as well. My speculation is that they are wider and much more simpler than in other GPUs, which directly translates to efficiency wins.

sudosysgen 4 years ago | |

Apple Silicon is not ahead at all on energy efficiency for desktop workloads. If they were ahead on energy efficiency, they would simply be ahead on power. Indeed, GPUs are massively parallel architectures, and they are generally limited by the transistor and power budget (and memory, of course).

Apple is simply behind in the GPU space.

> At $4800, an M1 Ultra Mac Studio appears to be far and away the cheapest machine you can buy with 128GB of GPU memory. With proper PyTorch support, we'll actually be able to use this memory for training big models or using big batch sizes. For the kind of DL work I do where dataloading is much more of a bottleneck than actual raw compute power, Mac Studio is now looking very enticing.

The reason why it's cheaper is that its memory is at a fraction (around 20-35%) of the memory bandwidth of a 128GB equivalent GPU set up, which also has to be split with the CPU. This is an unavoidable bottleneck of shared memory systems, and for a great many applications this is a terminal performance bottleneck.

That's the reason you don't have a GPU with 128GB of normal DDR5. It would just be quite limited. Perhaps for some cases it can be useful.

p1esk 4 years ago | | |

its memory is at a fraction (around 30-40%) of the memory bandwidth of a 128GB equivalent GPU setup

Here's some info about M1 memory bandwidth: https://www.anandtech.com/show/17024/apple-m1-max-performanc...

mkaic 4 years ago | | |

Interesting, I wasn't aware of the memory bandwidth point, though it makes sense. TIL!

dekhn 4 years ago | |

I remain skeptical that Apple's best GPU silicon will match nvidia's premiere products (either the top-end desktop card, or a server monster) for training.

It seems like this is ideal as an accelerator for already trained models; one can imagine Photoshop utilizing it for deep-learning based infill-painting.

I was doing training on battery with a laptop that had a 1080 and could do training; I have trained models on the airplane while totalyl unplugged and still had enough power to websurf afterwards.

hedgehog 4 years ago | |

To me the cool thing is working through a PyTorch-based course like FastAI on a local Mac may now be above the tolerably fast threshold.

mhh__ 4 years ago | |

The thing is with the efficiency (which I'm not sure of) and the competition (probably possible) is that the current nvidia lineup is pretty old and on an even older process. They have a big moat.

smoldesu 4 years ago | |

There's definitely competition, and it's going to be really interesting to watch Nvidia and Apple duke it out over the next few years:

- Apple undoubtedly owns the densest nodes, and will fight TSMC tooth-and-nail over first dibs on whatever silicon they have coming next.

- Apple's current GPU design philosophy relies on horizontally scaling the tech they already use, whereas Nvidia has been scaling vertically, albeit slowly.

- Nvidia has insane engineers. Despite the fact they're using silicon that's more than twice as large by-area when compared to Apple, they're still doubling their numbers across the board. And that's their last-gen tech too, the comparison once they're on 5nm later this summer is going to be insane.

I expect things to be very heated by the end of this year, with new Nvidia, Intel and potentially new Apple GPUs.

fulafel 4 years ago | |

> an M1 Ultra Mac Studio appears to be far and away the cheapest machine you can buy with 128GB of GPU memory

Interesting observation. I wonder is the biggest memory iGPU configuration you can get on the x86 side?

ekelsen 4 years ago |

Nice results! But why are people still reporting benchmark results on VGG? Does anybody actually use this network anymore?

Better would be mobilenets or efficientNets or NFNets or vision transformers or almost anything that's come out in the 8 years since VGG was published (great work it was at the time!).

singularity2001 4 years ago |

The installation command generated on https://pytorch.org/get-started/locally/ didn't install the latest version for me. What did it was:

pip3 install --pre torch==1.12.0.dev20220518 --extra-index-url https://download.pytorch.org/whl/nightly/cpu

singularity2001 4 years ago | |

If you came late make sure to update the date to 20220521 …

tzekid 4 years ago | |

Ahh just saw this after compiling pytorch from source. Thanks!

nafizh 4 years ago |

Exciting!! But don't see comparison with any laptop Nvidia GPUs in terms of performance. That would be insightful.

sudosysgen 4 years ago | |

It compares unfavourably, but then again NVidia GPUs on laptop are massive powerhogs.

smlacy 4 years ago | | |

Do apple users really require the ability to train large ML models while mobile and without access to A/C power? Is this a real-world use case for the target market?

buildbot 4 years ago |

This is very interesting since the M1 studio supports 128GB of unified memory - training a large memory heavy model slowly on a single device could be interesting, or inferencing a very large model.

zdw 4 years ago | |

Everything old is new again - the M1 studio's unified memory echos the SGI O2 which had similar unified CPU/GPU memory back in the 90's.

In both cases the unified memory machines outperformed much larger machines in specific use cases.

smoldesu 4 years ago | | |

...specific use cases being the key operand here. Unified memory is cool, but there are reasons we don't use it at-scale:

- It needs extremely high-bandwidth controllers, which severely limits the amount of memory you can use (Intel Macs could be configured with an order of magnitude more ram in it's server chips)

- ECC is still off-the-table on M1 apparently

- Most workloads aren't really constrained by memory access in modern programs/kernels/compilers. Problems only show up when you want to run a GPU off the same memory, which is what these new Macs account for.

- Most of the so-called "specific workloads" that you're outlining aren't very general applications. So far I've only seen ARM outrun x86 in some low-precision physics demos, which is... fine, I guess? I still don't foresee meteorologists dropping their Intel rigs to buy a Mac Studio anytime soon.

ivstitia 4 years ago |

There was a report comparing M1 Pro with several other Nvidia GPUs from a few months ago: https://wandb.ai/tcapelle/apple_m1_pro/reports/Deep-Learning...

I'm curious on how the benchmarks change with this recent new release!

almostdigital 4 years ago |

Anyone actually got this to run on an M1 Mac?

    $ conda install pytorch torchvision torchaudio -c pytorch-nightly
    Collecting package metadata (current_repodata.json): done
    Solving environment: failed with initial frozen solve. Retrying with flexible solve.
    Collecting package metadata (repodata.json): done
    Solving environment: failed with initial frozen solve. Retrying with flexible solve.

    PackagesNotFoundError: The following packages are not available from current channels:

      - torchaudio

And the pip install variant installs an old version of torchaudio that is broken

    OSError: dlopen(/opt/homebrew/Caskroom/miniforge/base/envs/test123/lib/python3.10/site-packages/torchaudio/lib/libtorchaudio.so, 0x0006): Symbol not found: __ZN2at14RecordFunctionC1ENS_11RecordScopeEb

fragmede 4 years ago | |

    pip3 install pytorch

worked for me. I think it's something with your brew installation.

    fragmede@samairmac:~$ python
    Python 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:24:02)
    [Clang 11.1.0 ] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import torch
    >>> torch.__file__
    '/Users/fragmede/projects/miniforge3/lib/python3.9/sitepackages/torch/__init__.py'
    >>>

almostdigital 4 years ago | | |

Does torchaudio work for you? I can get torch and torchvision to work but not torchaudio

boopmaster 4 years ago | |

pip install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu

Scene_Cast2 4 years ago |

I'm curious about the performance compared to something like, say, the RTX 3070.

my123 4 years ago | |

Low. Apple doesn't have matrix math accelerators in their current GPUs.

The neural engine is small and inference only. It's also only exposed by a far higher level interface, CoreML.

Where it could still make sense is if you have a small VRAM pool on the dGPU and a big one on the M1, but with the price of a Mac, not sure that makes a lot of sense either in most scenarios compared to paying for a big dGPU.

Kon-Peki 4 years ago | | |

> Apple doesn't have matrix math accelerators in their current GPUs.

That's because the M1 has a dedicated matrix math accelerator called AMX [1]. I've used it with both Swift and pure C.

https://medium.com/swlh/apples-m1-secret-coprocessor-6599492...

LeanderK 4 years ago | | |

> The neural engine is small and inference only

Why is it inference only? At least the operations are the same...just a bunch of linear algebra

apohn 4 years ago | |

I wrote a comment about an Tensorflow on M1 comparison to some cloud providers. I imagine PyTorch on M1 would give similar results. I think the gist would be that the 3070 is going to be a better investment.

https://news.ycombinator.com/item?id=30608125

ivstitia 4 years ago | |

Here are some comparison numbers I've come across: https://wandb.ai/tcapelle/apple_m1_pro/reports/Deep-Learning...

It is not really comparable on a step per second level but the power consumption and now GPU memory will make it pretty enticing.

MasterScrat 4 years ago |

Small code example in the PyTorch doc:

https://pytorch.org/docs/master/notes/mps.html

nxpnsv 4 years ago | |

Tried https://pytorch.org/tutorials/beginner/basics/quickstart_tut... with mps vs cpu. mps worked, but cpu actually was faster (16 vs 21s). Perhaps I am doing it wrong...

singularity2001 4 years ago |

Anyone else getting "illegal hardware instruction"?

(pytorch_env) ~/dev/ai/ python -c "import torch"

zimpenfish 4 years ago | |

IIRC, when I had that problem, it was because it was loading the wrong arch for Python.

in3d 4 years ago |

It’s surprising to see PyTorch developers working on things like that when common operations like group convolutions are still completely unoptimized on Nvidia GPUs, despite many requests.

jacobn 4 years ago | |

Grouped convolutions can't really run faster than groups * conv(ch/group) and I believe that's close to where they're at?

Note that for ch<O(512) (varies by GPU & hw) you tend to be memory-transfer-speed limited, not compute limited.

So unfortunately depthwise convolutions end up having terrible performance.

in3d 4 years ago | | |

Why wouldn’t you be able to run them in parallel using CUDA? You shouldn’t be memory-transfer speed limited when group convolution layers are a part of a bigger net.

Note that pointwise 1x1 convolutions are a special case of group convolutions and actually I think they might be specially optimized in PyTorch (I’d have to run some benchmarks to test it though).

arecurrence 4 years ago |

This is much nicer ergonomics than what I had to do for tensorflow. It’s ostensibly out of the box support as a different torch device.

mark_l_watson 4 years ago | |

I agree. I appreciated the M1/Metal TensorFlow support, but that was not as easy to setup.

alfalfasprout 4 years ago | | |

I mean, building tensorflow is generally an awful experience.

dangrie158 4 years ago | |

You must have installed it a while ago then :). I just recently did and only needed to I stall 2 packages via pip (i think tensoflow-macos and tenderfoot-metal ) which I found much better than wrangling with cuda and cudnn versions for Nvidia cards

dilielloneluca 4 years ago |

I started collecting benchmarks of the M1 Max on PyTorch here: https://github.com/lucadiliello/pytorch-apple-silicon-benchm...

munro 4 years ago |

yess! This is important for me, because I don't have any $$$ to rent GPUs for personal projects. Now we just need M1 support for JAX.

Since there are no hard benchmarks against other GPUs, here's a Geekbench against an RTX 3080 Mobile laptop I have [1]. Looks like it's about 2x slower--the RTX laptop absolutely rips for gaming, I love it.

[1] https://browser.geekbench.com/v5/compute/compare/4140651?bas...

jph00 4 years ago | |

You can use GPUs for free on Paperspace Gradient, Google Colab, and Kaggle.

kristianp 4 years ago |

A tangential thought: will we see animation studios buy mac studios for their rendering farms? What do they use these days, aws ec2?

Kalanos 4 years ago |

Anyone care to comment on how this is better than Metal's TensorFlow support?

macshome 4 years ago |

Does this work on any Metal hardware or just the M1 GPU?

atty 4 years ago | |

This is targeting AMD GPUs and M1 GPUs currently not targeting the integrated Intel GPUs present in Intel machines. However if you have a 16 inch Intel MBP, or a Mac Pro, etc, this should work with your AMD GPUs. That support isn’t in the nightly packages yet (only Apple Silicon support so far) but the PyTorch team is saying that it will be available by the end of the week hopefully. If you just can’t wait, you should be able to build from source to test it out right now.

cj8989 4 years ago |

really hope to see some comparisons with nvidia gpus!

toppy 4 years ago |

Does speed up refer to absolute value or percentage?

dagmx 4 years ago | |

At least for the charts, it looks like a multiplier (or divisor I guess) since the CPU baseline looks to be at 1

toppy 4 years ago | | |

You're right! I've missed this.

sbeckeriv 4 years ago |

What is the * in the chart referencing?

mrchucklepants 4 years ago | |

Probably supposed to be referencing the text under the plot stating the specific configuration of the hardware and software.

sbeckeriv 4 years ago | | |

looks like the website was updated after I posted. I used page search to look for the *.

amelius 4 years ago |

> Accelerated GPU training is enabled using Apple’s Metal Performance Shaders (MPS) as a backend for PyTorch.

What do shaders have to do with it? Deep learning is a mature field now, it shouldn't need to borrow compute architecture from the gaming/entertainment field. Anyone else find this disconcerting?

dagmx 4 years ago | |

Shaders are just the way compute is defined on the GPU.

Why is that concerning to you?

WhitneyLand 4 years ago | | |

It’s not the greatest term even for graphics only.

People new to CG are likely to intuit “shaders” as something related to, well, shading, but vertex shaders et al have nothing to do with the color of a pixel or a polygon.

my123 4 years ago | | |

That terminology isn't used at all in GPGPU compute APIs specifically tailored for that purpose, which use quite different programming models where you can mix host and device code in the same program.

And there are "GPUs" today that can't do graphics at all (AMD MI100/MI200 generations) or in a restricted way (Hopper GH100) which has the fixed function pipeline only on two TPCs, for compatibility, but running very slowly due to that.

my123 4 years ago | |

Apple doesn't have a separate API tailored towards compute only, but a single unified API that makes concessions to both.

Concessions towards compute: a C++ programming language for device code (totally unlike what's done for most graphics APIs!)

Concessions towards graphics: no single-source programming model at all for example...

sudosysgen 4 years ago | | |

Many GPUs allow you to write device code in C++ via SYCL. It works well enough.

geertj 4 years ago | |

Not sure if it’s concerning but it caught my eye as well.