Megatron-Turing NLG 530B, the World’s Largest Generative Language Model

Megatron-Turing NLG 530B, the World’s Largest Generative Language Model(developer.nvidia.com)

116 points by selimonder 4 years ago | 99 comments

cs702 4 years ago |

So we now have models with 0.5 trillion parameters, each the weight of a connection in a neural network.

Trillion-parameter models are surely within reach in the near term -- and that's only within two orders of magnitude of the number of synapses in the human brain, which is in the hundreds of trillions, give or take. To paraphrase the popular saying, a trillion here, a trillion there, and pretty soon you're talking really big numbers.

I know the figures are not comparable apples-to-apples, but still, I find myself in awe looking at how far we've come in just the last few years, to the point that we're realistically contemplating the possibility of seeing dense neural networks with hundreds of trillions of parameters used for real-world applications in our lifetime.

We sure live in interesting times.

YeGoblynQueenne 4 years ago | |

I don't understand this kind of comment. To my mind what it amounts to is "look at how big it is". Alright. So it's big. So what? Is this an elephant pageant?

Suppose a friend comes over and says "I went for dinner at a restaurant. Oh my god the portions were sooo big!". Wouldn't you want to know more information about the food and the restaurant, before you decided whether you're interested in it?

I appreciate that "big" is in peoples' minds associated with "strong", but most of the work in making language models bigger and bigger goes against the normal trend in computer science [1] and also neural networks research in gneral where the trend is to constantly try to reduce the size of models and improve their data efficiency.

What's worse, the trend to supersize language models is never justified, either theoretically (ha ha) or empirically in the relevant literature - and when rival teams make the obvious experiments the evidence is that size is not required to achieve good performance. For example:

It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners

https://aclanthology.org/2021.naacl-main.185/

____________

[1] Imagine someone bragging that their mergesort implementation has a million LOC! People brag about implementations in few lines of code, not many.

sanxiyn 4 years ago | | |

> What's worse, the trend to supersize language models is never justified, either theoretically (ha ha) or empirically in the relevant literature

What? That's absurd. Large language models are motivated by empirical scaling law. It is actually better justified than other ML research.

Scaling Laws for Neural Language Models: https://arxiv.org/abs/2001.08361

TeMPOraL 4 years ago | | |

> I appreciate that "big" is in peoples' minds associated with "strong"

In my mind, this is now called Pakled reasoning, from the scene in Star Trek: Lower Decks.

  Pakled rebel turned leader:
  "I am now Pakled leader. Behold my giant helmet!"

  Other Pakled:
  "He is strong!"

https://www.youtube.com/watch?v=lv1uhAa_M_U&t=193s

cs702 4 years ago | | |

> I don't understand this kind of comment. To my mind what it amounts to is "look at how big it is". Alright. So it's big. So what?

It's a good question. A while ago, Rich Sutton wrote a good answer for it.: http://incompleteideas.net/IncIdeas/BitterLesson.html -- I recommend reading the whole essay. Quoting him (emphasis mine):

> The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. The ultimate reason for this is Moore's law, or rather its generalization of continued exponentially falling cost per unit of computation. Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project, massively more computation inevitably becomes available. Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation. These two need not run counter to each other, but in practice they tend to. Time spent on one is time not spent on the other. There are psychological commitments to investment in one approach or the other. And the human-knowledge approach tends to complicate methods in ways that make them less suited to taking advantage of general methods leveraging computation.

> We have to learn the bitter lesson that building in how we think we think does not work in the long run. The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning.

> One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning.

A key related question -- to which no one has the answer today -- is whether we must scale computation to match or exceed that of the human brain to be able to replicate or surpass its cognitive abilities. (Note that this question is independent of whether doing so would require future theoretical breakthroughs -- another question to which no one knows the answer today.)

PS. See also sanxiyn's response: https://news.ycombinator.com/item?id=28838745

LuisMondragon 4 years ago | | |

New AI fallacy: appeal to size

singularity2001 4 years ago | |

Except that every synapse is not a dumb weight but a highly complex system connected to an even more complex system (aka neuron) which might each be a (super)computer on its own.

Given how extremely bad we are at computing, there is hope (for ai) that the neurons or their circuits are not _that_ powerful after all.

abecedarius 4 years ago | | |

Emulating a neuron != taking a comparable part in a computation. Probably the former is a lot more complex. For instance, an artificial net can take advantage of backpropagation in a separated training phase -- that's a lot of complexity that's factored out of the runtime phase.

alecst 4 years ago | | |

Last I heard (and I believe this could be wrong) my professor said that we basically understand how a single neuron works. That like basically if we do X input we get Y output, up to some accuracy. He used this to discuss the idea behind neural networks -- that each neuron is simple enough to model, all we need to worry about is the weights and the dynamics of the network as a whole.

How much of a simplification is that? And how much does the accuracy of such a model matter, in the grand scheme of things?

tudorw 4 years ago | | |

Won't somebody think of the exosomes and telocytes?

LuisMondragon 4 years ago | |

My issue with this kind of reasoning is the comparison and reference to the human brain. The potential and reach of AI transcends the brain. We never had to master the "mystery" of how birds fly to invent aviation. It was never necessary to compare the number of turbine revolutions of early airplanes to the number of an eagle's feathers. Maybe birds were an inspiration or a metaphor, but thankfully aviation has not been limited to the means of propulsion of the beautiful yet humble pigeon. The potential of aviation has taken us into space exploration and massive international travel. I don't know where AI will take us, but I don't think it will be constrained by this temporary organ called 'human brain'.

EvgeniyZh 4 years ago | |

Not all weights are born equal, different paradigms allow more parameters while being less parameter-efficient, e.g. https://openreview.net/forum?id=TXqemS7XEH

rajnathani 4 years ago | |

If retrieval based NLP [0] becomes a thing, then trillion plus parameters models will likely be less of a thing; as very likely, most of these tens to hundreds of billions of parameters are likely over-fitting (better word: memorized) on training data [text corpus] as seen in the case of GPT-3.

[0] https://ai.stanford.edu/blog/retrieval-based-NLP/

cs702 4 years ago | | |

Yes, self-attention mechanisms are dense associative memories, so it might be possible to replace them in many cases with simpler storage mechanisms. Still, I would count the required storage space as part of a model's parameter size -- e.g., a model consisting of 1 trillion values in RAM and 99 trillion values in storage consists of... 100 trillion values.

dustfinger 4 years ago | |

The switch transformer has already achieved a trillion parameters.

https://arxiv.org/abs/2101.03961

codeulike 4 years ago | |

Unless we have misunderstood neurons, and microtubules are the fundamental computational unit in which case we are out by an order of magnitude

eximius 4 years ago | | |

There was a result recently of modeling an organic neuron with 1000 digital neurons.

And even if that result was perfect modeling of the neuron, that assumes perfect and exhaustive data readings on the organic neuron, which is, frankly, unlikely. (Not that I know how to estimate how much it's missing, but I don't think we fully understand a single neuron yet.)

postalrat 4 years ago | |

CPU in kilohertz then megahertz then gigahertz then it stopped.

RAM in kilobytes then megabytes then gigabytes then it stopped.

sanxiyn 4 years ago | | |

Yes for CPU, no for RAM. You can buy a computer with terabytes of RAM just fine. It's just expensive.

manquer 4 years ago | | |

Those are material science and physical limitations.

Number of parameters in a neural network is not really limited that way, doing useful compute with it is a different matter

ZoomerCretin 4 years ago | |

China's Wu Dao 2.0 has 1.75 trillion parameters. https://towardsdatascience.com/gpt-3-scared-you-meet-wu-dao-...

fspeech 4 years ago | |

A 10 trillion parameter model was mentioned here: https://mobile.twitter.com/ethancaballero/status/14458268620...

ImprobableTruth 4 years ago | | |

That's MoE.

bane 4 years ago |

What's really interesting is that these models are using some non-trivial portion of all easily accessible human writing -- yet humans learn language really well with significantly less input data. What's missing in the field to replicate human performance in learning?

petters 4 years ago |

Training data has 0.339T tokens, less than the number of training parameters. A model like that could store all of the training text with 100B+ parameters left for computation.

canjobear 4 years ago | |

Parameters have been sufficient to memorize the training data for a while now. The fact that neural networks still generalize in this setting is a big mystery that is under active investigation.

knuthsat 4 years ago | |

For some reason this issue with model having insane amounts of weights but training data being small is not something that is an issue for modern NNs.

lucidrains 4 years ago | | |

https://arxiv.org/abs/2109.02355

visarga 4 years ago | |

But then you try to predict the next token on a completely unseen piece of the corpus and fail miserably if all you do is store the training data.

tgv 4 years ago | |

A single weight can’t encode an individual word, but the ratio looks close to overfitting too me too.

inciampati 4 years ago | | |

I've often wondered if a lighter reinforcement learning based model on top of a full text index might do as well or better than these putatively overfit language models. Curious if anyone knows of ongoing or recent work on this approach.

petters 4 years ago | | |

If 16-bit floating-point numbers are used, it can presumably encode all tokens. In theory. It would not be very easy to work with.

thewarrior 4 years ago | |

Maybe that’s what it’s doing under the hood.

xnx 4 years ago |

This reminds me a little bit of the early 2000's where search engines would list the number of indexed pages on their homepage. For language models, does large = good? I'm guessing the quality of the corpus matters as much.

thunderbird120 4 years ago | |

>For language models, does large = good?

The short answer is yes, the long answer is it's complicated.

You could actually think of these models as a type of indexer because, at their heart, what they are doing is memorizing the training data and storing it in such a way that incomplete samples can be used as keys to extract complete samples. The magic happens because the models themselves (even the 100+ billion parameter ones) are nowhere near complex enough to actually store all of these possible key value pairs. Instead, the model has to compress its representation of the data which leads to generalization. Larger models can model more complexities which leads to better performance as long as your training dataset is sufficiently large and varied.

ewheeler 4 years ago | |

Maybe? The Scaling Hypothesis[1] suggests that greater capabilities of intelligence may emerge from scaling up 'scalable architectures' to large sizes. GPT-3 exhibits 'meta-learning' capabilities that GPT-2 did not (like learning how to sum numbers)--probably just because its a 100x larger version of GPT-2.

[1] https://www.gwern.net/Scaling-hypothesis

bpiche 4 years ago | |

I'm not sure it moves the needle on NLU/classification tasks very much, compared to models with many fewer parameters. But it does seem to make the NLG better, which is what Microsoft seems obsessed with lately.

posharma 4 years ago |

This is great. Now, how do we inference these models economically? It appears there's some kind of competition to train larger and larger models, but the inferencing side of the story seems to be neglected?

dwohnitmok 4 years ago | |

Model inference is actually comparatively very cheap. If you have the resources to train a model, you most definitely have the resources to run it.

manquer 4 years ago | | |

Not necessarily, you train once , you run inferences billions of times maybe. The compute required could be beyond your resources.

shock-value 4 years ago | | |

Does that hold as the workload scales up? E.g. could this or similar models be used as part of a general-purpose search engine whereby (at least) one inference is completed per unique search? Aside from computation, I know these models consume an intense amount of memory -- would that scale horizontally easily / economically? Would it need to?

buffington 4 years ago | |

When you say "inference", do you mean "interface", or is "inference" an ML term I'm not familiar with?

igorkraw 4 years ago | | |

It's a ML term, inference basically means using the probability model you learned to draw "inferences" about a piece of data. In this context, it means giving the language model some context and using some method (either arg max sampling or something more sophisticated like beam search) to do what amounts to statistical auto implemention on it. As you might imagine, doing this with 530 GB of data at speed is quite energy intensive, even though there are things you can do to compress the model (distillation, pruning, compression/discretization) and specialised inference hardware.

Technically there is some very specific meaning to inference vs. prediction, but it's been heavily overloaded with meaning by now

sanity31415 4 years ago | | |

First you train a model then you use it, "inference" is a fancy word for using the model.

miket 4 years ago |

https://en.wikipedia.org/wiki/Wu_Dao

bobm_kite9 4 years ago |

I guess I'm interested to see if this performs qualitatively better than GPT-3, given how many more parameters it has.

However, I think this is really a dead-end: throwing more hardware at this is just going to generate better-sounding nonsense. Yes, we are learning the "model" of the English language - which words go with which others, but successively larger transformer models don't really expose much more about the nature of intelligent conversation.

I think we need a better algorithm now.

jamesbriggs 4 years ago | |

Also agree with this, it's almost like a marketing ploy - especially from OpenAI. They produce awesome stuff but things with GPT can get silly sometimes, like when they wouldn't release the larger versions because 'they were too powerful', and in the end you ask it how many eyes my foot has and it says seven...

It producing interesting results, but doesn't really progress the field - although who knows, maybe skynet is actually a 100T parameter transformer

simonh 4 years ago | |

Listening to conversations and learning how they flow is only one aspect of language learning. What these things are missing is the interactive part. Next generation systems need to be able to form hypotheses about what appropriate responses should be, try out various responses and then see what the results are. They can only learn so much from passive consumption of training sets, so I agree this approach is going to hit a wall of diminishing returns.

putlake 4 years ago |

China's WuDao model had 1.75T parameters and Google's Switch transformer had 1.6T. How is this the world's largest then?

captn3m0 4 years ago |

Interesting that books3 and The Pile are among the largest corpus used for training - both with copyright concerns.

robbedpeter 4 years ago | |

Do you want to be the reason we can't have nice things? Please don't post things like this.

captn3m0 4 years ago | | |

They’re using and citing it, better than whatever GPT-3 did.

savant_penguin 4 years ago |

Really cool!

I'd love to see a table comparing the results against the other gigantic models (I know could Google the other results and merge them together but no thanks)

rustc 4 years ago |

Has there been any update on the legality of using this kind of model? Is it ok to just crawl the web, take any content you want, train a model and sell access to the model like OpenAI/GPT-3/GitHub Copilot?

wyldfire 4 years ago | |

For the most part, everything that's not barred by law is "legal." Does this use constitute copyright infringement (if it were trained on copyrighted material)? IMO no, but it depends very much on the use of the model. Copilot is especially interesting because instead of being used for simple inference the model is being used to author new works that might aspire to also be copyrighted. Are those new works derivative works? Perhaps. We consider art and science produced by humans to be inspired in part by that which they've been exposed to before. If the model hasn't been overfitted, it should generalize its 'knowledge' sufficiently that it's 'similar' to our intelligence. Humans can commit copyright infringement when they recall and author content so specifically as to be a derived work.

In any case: my opinion matters for naught. The only 'update' you'd get that matters is from a court producing a ruling. Legal journals might chime in but their opinion isn't binding. Theoretically there could be legislation to clarify but that's probably a really, really, really long way off.

Certainly some of the training looks to be content that's not copyrighted or no longer copyrighted, btw.

sonic-boom 4 years ago |

Any idea if they’ll release an API similar to GPT-3? It’s great that larger and larger models are trained but without enabling access to the trained models developers are left out from the progress…

jowday 4 years ago | |

I hope they don’t release an API the way they released an API for GPT-3.

sanxiyn 4 years ago | | |

Why? What do you do not like about GPT-3 API?

RhysU 4 years ago |

So, uh, what do the seed-to-seed variance studies look like on a network of this size? Surely someone trained 100 to see the distribution. ducks

trash3 4 years ago |

How has the previous largest model, gpt3, generateda value? How much better is this model at those tasks?

sanxiyn 4 years ago | |

GPT-3 powers GitHub Copilot, so it generated some value.

bpiche 4 years ago |

Wonder how much compute it would cost to train this thing, if you weren't Nvidia..

macrolime 4 years ago |

Will anyone outside of Nvidia be able to access it? GPT-3 at least has an API.

saurkt 4 years ago | |

(Team member of this project) Just a clarification, both Microsoft and Nvidia have ownership of this model. Here is the Microsoft version of same announcement.

https://www.microsoft.com/en-us/research/blog/using-deepspee...

thenightcrawler 4 years ago | |

hoping so!

sonic-boom 4 years ago | | |

Same!

canjobear 4 years ago |

What's the perplexity?

moochi 4 years ago |

all these models are over-hyped. we are nowhere close to AGI until we can come up with a reasonable definition for consciousness

ultra_nick 4 years ago | |

It's possible consciousness isn't a real thing. Humans might just be big neural networks that predict the actions most likely to result in survival.