Been working on a guide for ML folks to upgrade their single GPU training code to multi GPU and multi node. Code diffs and explanations are included. The guide builds up to this final chapter (linked) on how to train a very large model like Llama 3.1 405B on a big cluster with plain pytorch. Everything is just written using the direct pytorch apis (other than the model code which is just using `transformers` models). If there are topics of interest feel free to open an issue in the repo, and contributions are welcome. I'm investigating adding a chapter on tensor parallelism, but it's support in pytorch is still early stages. |