Show HN: How to guide on training Llama-405B using PyTorch distributed APIs

Show HN: How to guide on training Llama-405B using PyTorch distributed APIs(github.com)

3 points by lambda-research 1 year ago | 4 comments

Been working on a guide for ML folks to upgrade their single GPU training code to multi GPU and multi node. Code diffs and explanations are included.

The guide builds up to this final chapter (linked) on how to train a very large model like Llama 3.1 405B on a big cluster with plain pytorch.

Everything is just written using the direct pytorch apis (other than the model code which is just using `transformers` models).

If there are topics of interest feel free to open an issue in the repo, and contributions are welcome.

I'm investigating adding a chapter on tensor parallelism, but it's support in pytorch is still early stages.

lostmsu 1 year ago |

The guide does not say how efficient this run was in terms of GPU utilization (tops/theoretical max tops).

lambda-research 1 year ago | |

Hey there are some details about this scattered throughout. The answer really depends on the technique. For DDP you can fairly easily get same throughput as single gpu throughput (we were getting ~80% gpu util for multiple nodes iirc), as long as all the workers are getting the same sized data.

Once you move to training really large models like Llama 405B with FSDP and use things like CPU offloading, the throughput goes down quite a bit due to all the data transfers between CPU/GPU. If you have large enough clusters and don't have to use CPU offloading, you can get higher throughput.

lostmsu 1 year ago | | |

You are talking about a specific setup:

> Here we are going to utilize an 8 node cluster (64 H100 GPUs)

lambda-research 1 year ago |

Let me know if there are any questions or suggestions!

Feel free to open issue on github, and contributions are welcome also