A GPT in 60 Lines of NumPy

A GPT in 60 Lines of NumPy(jaykmody.com)

1563 points by squidhunter 3 years ago | 146 comments

jaykmody 3 years ago |

Hey ya'll author here!

Thank you for all the nice and constructive comments!

For clarity, this is ONLY the forward pass of the model. There's no training code, batching, kv cache for efficiency, GPU support, etc ...

The goal here was to provide a simple yet complete technical introduction to the GPT as an educational tool. Tried to make the first two sections something any programmer can understand, but yeah, beyond that you're gonna need to know some deep learning.

Btw, I tried to make the implementation as hackable as possible. For example, if you change the import from `import numpy as np` to `import jax.numpy as np`, the code becomes end-to-end differentiable:

    def lm_loss(params, inputs, n_head) -> float:
        x, y = inputs[:-1], inputs[1:]
        output = gpt(x, **params, n_head=n_head)
        loss = np.mean(-np.log(output[y]))
        return loss
  
    grads = jax.grad(lm_loss)(params, inputs, n_head)

You can even support batching with `jax.vmap` (https://jax.readthedocs.io/en/latest/_autosummary/jax.vmap.h...):

    gpt2_batched = jax.vmap(gpt2, in_axes=0)
    gpt2_batched(batched_inputs) # [batch, seq_len] -> [batch, seq_len, vocab]

Of course, with JAX comes in-built GPU and even TPU support!

As far as training code and KV Cache for inference efficiency, I leave that as an exercise for the reader lol

tysam_and 3 years ago | |

"hackable" and "simple yet complete technical introduction"

Music to my ears, well done and don't worry too much about the negative comments! They'll come out for anything you do I think.

I saw a tweet from someone the other day talking about how they massively increased their training speed by changing part of their architecture to have dimensions that were a factor of 64 rather than a prime-like kind of number.

One of the comments below it? ~"Seems very architecture specific."

lol.

So don't sweat it! <3 Great work and thanks for putting yourself out there, super job! :D :D :D :D :)))))) <3 :D :D :fireworks:

pavelstoev 3 years ago | | |

We do GPU-specific training and inference speedups, at CentML.

moconnor 3 years ago | |

This is beautiful. Having worked with everything from nanoGPT to Megatron, sitting down and reading through picoGPT.py was clear and refreshing with just the essential details. Nothing left to add, nothing left to take away: perfection.

jimbokun 3 years ago | |

This looks like something Peter Norvig would write, and that’s about the highest compliment I can give.

eslaught 3 years ago | |

> GPU support

If you haven't tried cuNumeric [1], you really ought to. It's a drop-in NumPy wrapper for distributed GPU acceleration. Would be interesting to see if it works for this.

[1]: https://github.com/nv-legate/cunumeric

VHRanger 3 years ago | | |

The problem with drop-in replacements between CPU and GPU code is that performance GPU code requires rethinking the dataflow often -- so even if the code itself is a drop-in, the "make it good" part still requires some rewriting.

I'd be curious how that library compares to other numeric python GPU libraries

smcin 3 years ago | |

> For clarity, this is ONLY the forward pass of the model. There's no training code, batching, kv cache for efficiency, GPU support, etc ...

Neat, but please add one-line comments/docstrings where these missing bits would go.

ddalex 3 years ago | |

Hi there, thank you for putting this together !

I want to commend you for one of the best written introductions in this space that I've seen, especially the excellent use of hyperlinking that points to really good resources exactly at the right time !

ngcc_hk 3 years ago | |

Hope it move to like open go ai version. Alpha go comes and goes. We need one and open sources we have one. Hope this is the same.

Teamteam16 3 years ago | |

Tteam5049@gmail.com

simonw 3 years ago |

This article is an absolutely fantastic introduction to GPT models - I think the clearest I've seen anywhere, at least for the first section that talks about generating text and sampling.

Then it got to the training section, which starts "We train a GPT like any other neural network, using gradient descent with respect to some loss function".

It's still good from that point on, but it's not as valuable as a beginner's introduction.

barbazoo 3 years ago |

So much criticism in the comments. I appreciated the write-up and the code samples. For some people not in ML like myself it's hard to understand the concept behind GPT and this made it a little bit clearer.

tysam_and 3 years ago | |

I think this is a factor of putting one's self out there. I've had this happen on ML projects I've put out too, though being hyper-engaged in trying to thoughtfully respond to all (or as many as possible of) the comments section for me has seemed to lower negativity a bit just because it brings up the 'person-in-the-room' effect up to an online audience...at least, so I think! :D

I thought it was a great post and manky kudos to the author for putting themselves out like that! I really appreciated this and any work that does this kind of effort in onboarding people and giving people tools to understand something well really I think has some of the most long-term impact to the field.

Lowering barriers to entry, making resources accessible to all, and decreasing experimentation cycle time I think are some of the most critical components to making any progress at all in the field beyond a basic pittance. Imagine if everyone had easy access to, knowledge about, and rapid experimentation results in things like quantum mechanics, large-algorithm testing, painting arts, musical arts, etc. It would drive things so much further forward at an individual and field-based level so quickly. <3 :)))) :D :D ;D :D :D :))))))))

victor9000 3 years ago | | |

People respond with negativity for a variety of reasons, and it often has more to do with the commenter than the content.

lspears 3 years ago |

For those interested I would also check out Andrej Karpathy's YouTube video on building GPT from scratch:

https://youtu.be/kCc8FmEb1nY

azath92 3 years ago | |

Karpathy has a bunch of great resources on this front! His minGPT writeup is excellent https://github.com/karpathy/minGPT His more recent project nanoGPT which references this video is a much more capable, but still learning friendly, implementation.

ultrasounder 3 years ago |

I also learnt a ton from NLPDemystified-https://www.nlpdemystified.org. In fact I used this resource first before attempting Andrej Karpathy's https://karpathy.ai/zero-to-hero.html. I find Nitin's voice soothing and am able to focus more. I also found the pacing good and the course introduces a lots of concepts a beginner level and also points to appropriate resources along the way(spacy for instance). Overall an exciting time to be a total beginner looking to grok NLP concepts.

adamnemecek 3 years ago |

It turns out that transformers have a learning mechanism similar to autodiff but better since it happens mostly within the single layers as opposed to over the whole graph. I wrote a paper on this recently https://arxiv.org/abs/2302.01834v1. The math is crazy.

tpoacher 3 years ago | |

"Combinatorial Hopf" would make an excellent beer name!

"Bartender! A half-pint of your finest Combinatorial Hopf, if you please!"

LukeB42 3 years ago | |

Can you explain like I'm 5 why this matters distinctly from how transformers are normally trained with autodiff and what its possible applications are?

adamnemecek 3 years ago | | |

I’m talking about attention only transformers. Those don’t have an autodiff but still learn. The math is actually really cool.

macrolocal 3 years ago | |

First question: why should the attention mechanism output and residual stream match?

adamnemecek 3 years ago | | |

Match is a bad word, the don’t match, they are duals. The residual stream aka identity mapping needs to be the identity of the attention mechanism as the attention mechanism learns.

But this is the same for all residual streams, not just those in transformers.

Join my discord to discuss this further https://discord.gg/mr9TAhpyBW

naasking 3 years ago | |

Can this all be done on the GPU so the CPU doesn't need to be involved to adjust the weights?

eddsh1994 3 years ago |

Why do people in ML put imports inside function definitions?

teaearlgraycold 3 years ago |

Reminds me the scene from Westworld where they explain their failed prototypes of the human mind with millions of lines of code. The version that finally worked was only a few dozen.

qwerty456127 3 years ago |

How powerful/heavy it is? Some time ago here was a post about implementing a GPT on a very constrained computer (under a gigabyte of RAM, some old CPU, no GPU (?)) as opposed to an ordinary kind of GPT requiring terabytes of RAM.

I immediately thought it would be nice to do something in the middle: taking full advantage of a reasonably modern multicore CPU with AVX support, a humble yet again reasonably modern OpenCL-capable GPU and some 32 Gigabytes of RAM.

lvwarren 3 years ago |

Make this change in utils.py:

  def load_gpt2_params_from_tf_ckpt(tf_ckpt_path, hparams):
       [...]
        #name = name.removeprefix("model/")
        name = name[len('model/'):]

and you're cool example will run in Google Colab under Python 3.8 otherwise the 3.9 Jupyter patching is a headache.

est 3 years ago |

> GPT-3 was trained on 300 billion tokens of text from the internet and books:

> GPT-3 is 175 billion parameters

Total newbie here. What does these two numbers mean?

If running huge number of texts through BPE, we get a array with length of 300B ?

What's the number if we de-dup these tokens? (size of vocab?)

175B parameters means there are somewhat useful 175B floats in the pre-trained neural network?

code_runner 3 years ago | |

I’ll do my best.

Number of params is the number of weights. Basically the number of learnable variables.

Number of tokens is how many tokens it saw during training.

Vocab size is the number of distinct tokens.

The relationship between params/tokens/compute power is something people have studied a good deal and how it affects model performance. https://arxiv.org/pdf/2203.15556.pdf

eslaught 3 years ago |

I know this probably isn't intended for performance, but it would be fun to run this in cuNumeric [1] and see how it scales.

[1]: https://github.com/nv-legate/cunumeric

voz_ 3 years ago |

Wonderfully written, I love the amount of detail put into the diagrams. Would love breakdowns like this for more stuff :)

durdn 3 years ago |

Very impressive. Recently I watched this really amazing lecture on building GPT from scratch from Karpathy, I was blown away: https://www.youtube.com/watch?v=kCc8FmEb1nY&t=642s

est 3 years ago |

If I maintain an open source project, could I build a doc page using a small GPT allowing users to query FAQ and common methods using natural language?

hummus_bae 3 years ago | |

You can build anything using numpy. You can build a supercomputer out of duct tape if you want to. Spinning up a db, serving an API, doing natural language processing, ... whatever you want. That said, there are niche solutions that do these things well and that can save you a lot of work. There are also frameworks (such as Pyntango/Jupyter/Nbspectrum) that lets you spin up a simple API quickly.

thomasfromcdnjs 3 years ago |

This reads really well, thank you very much.

lvwarren 3 years ago |

make this change and it will run under Python 3.8 in google colab

        #name = name.removeprefix("model/")
        name = name[len('model/'):]

in function: load_gpt2_params_from_tf_ckpt in the utils.py module

sva_ 3 years ago |

Impressive, but only forward pass.

anigbrowl 3 years ago | |

I think the completeness and self-contained-ness more than offsets the limited scope. One of the problems in the ML field is rapidly multiplying logistical complexity, and I appreciate an example that is (somewhat) functional but simple enough to fit on a postcard and using very basic components.

thwayunion 3 years ago | |

It's an excellent learning tool :) Doing the backward pass in the same style would be a great tool for teaching.

time_to_smile 3 years ago | |

just replace the numpy code with jax.numpy as you should have a fully differentiable model ready for training!

pumanoir 3 years ago | | |

For someone not familiar with jax, if I do the suggested replacement. What'd be the little extra code to make it do the backward pass? Or is it all automatic and we literally would not need extra lines of code?

insane_dreamer 3 years ago |

nice and clear. a worthy contribution to the subject.

terran57 3 years ago |

From the article:

"Of course, you need a sufficiently large model to be able to learn from all this data, which is why GPT-3 is 175 billion parameters and probably cost between $1m-10m in compute cost to train.[2]"

So, perhaps better title would be "GPT in 60 Lines of Numpy (and $1m-$10m)"

eric_hui 3 years ago |

fantastic article about GPT. Thank you for sharing

freecodyx 3 years ago |

Since most models require little code compared to big software projects, why not use c++ or any other compiled language directly. Python with it’s magic functions, shortcuts is just hiding too much complexity which can result in bug performance issues. Plus code is more hard to maintain

CaptainNegative 3 years ago | |

> Python with it’s magic functions, shortcuts is just hiding too much complexity

One counterpoint would be that verbosity, especially in the heavy syntax style of languages such as C++, distracts the reader and helps bugs hide in plain sight. For a silly example, imagine trying to read and verify the correctness of an academic paper from its uncompiled LaTeX source.

mhh__ 3 years ago | |

A lot of AI (not a huge amount but more than you'd think) people can't code in any sense that would get them a job at a normal software company, Python is easy and fast enough to last until the model is obsolete.

stavros 3 years ago | |

Which magic functions and shortcuts in the posted code do you feel might introduce bugs?

freecodyx 3 years ago | | |

In general, the article is fine.