CodeParrot: Train and evaluate your own CoPilot model(huggingface.co) |
CodeParrot: Train and evaluate your own CoPilot model(huggingface.co) |
What is it? It is a code base to build large language models for code generation from scratch. It includes the code to clean up GitHub scale datasets, train GPT-2 style models from scratch on distributed infrastructure, and an evaluation benchmark of OpenAI's HumanEval dataset. We are also releasing two trained models and some demos to play with the models: https://huggingface.co/spaces/lvwerra/codeparrot-generation
This is my first project at Hugging Face and I thought this is a good opportunity to talk about some of the difficulties behind the scenes of such a project, which are usually not so much communicated after its release. The topic of the story: distributed debugging is hard!
The first few small experiments training GPT-2 models on a code dataset went well, so we decided to scale and train the first smallish model for longer. For mysterious reasons, the training just stopped after a few hours. Not with an error, but the training loop just didn't continue. Even more interesting, when repeating the experiment it happened again but always at another step. And we never observed it in experiments with fewer workers. What the hell was going on?
After literally weeks of debugging and experimenting, finally some insight. I started to log everything with the maximum verbosity possible. Interestingly, the training stop always coincided with a debug message concerning some retry. It did not always stop when such a message appeared, but whenever it stopped such a message was there.
We used a feature called streaming in the the Hugging Face datasets library to read the data on the fly. It has a retry mechanism when reading the next chunk from a file fails. It turns out that when many workers are reading from the same file and one worker fails and tries to read again, there is a tiny chance for a deadlock. The more workers there are, the higher the chance of a deadlock, which also explains why we never observed this in smaller experiments.
The retries could easily be avoided by changing some streaming settings, and later we switched to having a single data processing worker for iterable datasets later, which resolved the issue altogether and also improved training efficiency by up to 25%!
My main takeaway from this: I totally underappreciated how hard and stressful the expensive debugging sessions are at scale. Every experiment is expensive and might take a long time to fail which limits the iteration cycle considerably. Distributed systems also behave very differently to simple single process programs, and it takes some time to adapt the mental model.
If you are curious about the what it takes to train such a model, checkout the blog post for a brief tutorial on training CodeParrot : https://huggingface.co/blog/codeparrot
Also, all code is open source, free to use, and available here: https://github.com/huggingface/transformers/tree/master/exam...
If you are interested in more details about the design considerations when setting up a large dataset, building efficient tokenizers, and architecture choices, make sure you have a look at the CodeParrot chapter in the upcoming book on Transformers and NLP: https://learning.oreilly.com/library/view/natural-language-p...