Megatron-Turing NLG 530B, the World’s Largest Generative Language Model(developer.nvidia.com) |
Megatron-Turing NLG 530B, the World’s Largest Generative Language Model(developer.nvidia.com) |
Trillion-parameter models are surely within reach in the near term -- and that's only within two orders of magnitude of the number of synapses in the human brain, which is in the hundreds of trillions, give or take. To paraphrase the popular saying, a trillion here, a trillion there, and pretty soon you're talking really big numbers.
I know the figures are not comparable apples-to-apples, but still, I find myself in awe looking at how far we've come in just the last few years, to the point that we're realistically contemplating the possibility of seeing dense neural networks with hundreds of trillions of parameters used for real-world applications in our lifetime.
We sure live in interesting times.
Suppose a friend comes over and says "I went for dinner at a restaurant. Oh my god the portions were sooo big!". Wouldn't you want to know more information about the food and the restaurant, before you decided whether you're interested in it?
I appreciate that "big" is in peoples' minds associated with "strong", but most of the work in making language models bigger and bigger goes against the normal trend in computer science [1] and also neural networks research in gneral where the trend is to constantly try to reduce the size of models and improve their data efficiency.
What's worse, the trend to supersize language models is never justified, either theoretically (ha ha) or empirically in the relevant literature - and when rival teams make the obvious experiments the evidence is that size is not required to achieve good performance. For example:
It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners
https://aclanthology.org/2021.naacl-main.185/
____________
[1] Imagine someone bragging that their mergesort implementation has a million LOC! People brag about implementations in few lines of code, not many.
What? That's absurd. Large language models are motivated by empirical scaling law. It is actually better justified than other ML research.
Scaling Laws for Neural Language Models: https://arxiv.org/abs/2001.08361
In my mind, this is now called Pakled reasoning, from the scene in Star Trek: Lower Decks.
Pakled rebel turned leader:
"I am now Pakled leader. Behold my giant helmet!"
Other Pakled:
"He is strong!"
https://www.youtube.com/watch?v=lv1uhAa_M_U&t=193sIt's a good question. A while ago, Rich Sutton wrote a good answer for it.: http://incompleteideas.net/IncIdeas/BitterLesson.html -- I recommend reading the whole essay. Quoting him (emphasis mine):
> The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. The ultimate reason for this is Moore's law, or rather its generalization of continued exponentially falling cost per unit of computation. Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project, massively more computation inevitably becomes available. Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation. These two need not run counter to each other, but in practice they tend to. Time spent on one is time not spent on the other. There are psychological commitments to investment in one approach or the other. And the human-knowledge approach tends to complicate methods in ways that make them less suited to taking advantage of general methods leveraging computation.
> We have to learn the bitter lesson that building in how we think we think does not work in the long run. The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning.
> One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning.
A key related question -- to which no one has the answer today -- is whether we must scale computation to match or exceed that of the human brain to be able to replicate or surpass its cognitive abilities. (Note that this question is independent of whether doing so would require future theoretical breakthroughs -- another question to which no one knows the answer today.)
PS. See also sanxiyn's response: https://news.ycombinator.com/item?id=28838745
Given how extremely bad we are at computing, there is hope (for ai) that the neurons or their circuits are not _that_ powerful after all.
How much of a simplification is that? And how much does the accuracy of such a model matter, in the grand scheme of things?
And even if that result was perfect modeling of the neuron, that assumes perfect and exhaustive data readings on the organic neuron, which is, frankly, unlikely. (Not that I know how to estimate how much it's missing, but I don't think we fully understand a single neuron yet.)
RAM in kilobytes then megabytes then gigabytes then it stopped.
Number of parameters in a neural network is not really limited that way, doing useful compute with it is a different matter
The short answer is yes, the long answer is it's complicated.
You could actually think of these models as a type of indexer because, at their heart, what they are doing is memorizing the training data and storing it in such a way that incomplete samples can be used as keys to extract complete samples. The magic happens because the models themselves (even the 100+ billion parameter ones) are nowhere near complex enough to actually store all of these possible key value pairs. Instead, the model has to compress its representation of the data which leads to generalization. Larger models can model more complexities which leads to better performance as long as your training dataset is sufficiently large and varied.
Technically there is some very specific meaning to inference vs. prediction, but it's been heavily overloaded with meaning by now
However, I think this is really a dead-end: throwing more hardware at this is just going to generate better-sounding nonsense. Yes, we are learning the "model" of the English language - which words go with which others, but successively larger transformer models don't really expose much more about the nature of intelligent conversation.
I think we need a better algorithm now.
It producing interesting results, but doesn't really progress the field - although who knows, maybe skynet is actually a 100T parameter transformer
I'd love to see a table comparing the results against the other gigantic models (I know could Google the other results and merge them together but no thanks)
In any case: my opinion matters for naught. The only 'update' you'd get that matters is from a court producing a ruling. Legal journals might chime in but their opinion isn't binding. Theoretically there could be legislation to clarify but that's probably a really, really, really long way off.
Certainly some of the training looks to be content that's not copyrighted or no longer copyrighted, btw.
https://www.microsoft.com/en-us/research/blog/using-deepspee...
So they are missing 2 years worth of visual, auditory, tactile and other modalities (grounding), having direct access to change their environment (embodiment) and being part of our society or an AI society (social).
This is the paper I love to link in response to these sort of objections.
The OP's criticism is valid. Being forced to learn everything end-to-end, from scratch, is a severe limitation.
This is not to say that language models are efficient, of course. That's not even remotely true. But we seem to under-estimate how much time and resources we need to learn something.
Now compare to single GPU prices...
Edit:
>> A key related question -- to which no one has the answer today -- is whether we must scale computation to match or exceed that of the human brain to be able to replicate or surpass its cognitive abilities.
What "computation" is that? Are you talking about scaling up neural networks, which is more in the context of the conversation, but requires some very big assumptions about (artificial) neural networks? Do you mean a different kind of computation?
(Note: my comment, plus the above edit, is a series of questions and I recognise that commens like that can come across as standoffish. This is not my intention, so please accept the questions above as having been asked in the most neutral tone as possible and in the interest of promoting conversation, rather than confrontation.)
Yes. But note that under the rubric of "deep neural networks" or "deep learning," I would include a lot of things, including combinations of methods like "deep reinforcement learning," learning by self-play via gradual evolution of surviving models, models that use "dense associative memories," of which transformers are only one special case, and future deep learning methods that have not yet been discovered.
And yes, some very big assumptions are required!
FWIW, your comments did not come across as standoffish to me :-)
[1] https://blog.google/products/search/search-language-understa...
They're neither. When performing very human-adjacent tasks, it will certainly put the ML algorithm at a disadvantage compared to us.
But for non-human adjacent tasks, say interpreting what a sequence of amino acids actually means, we can expect the computer to absolutely crush us because our stupid human heuristics take us absolutely nowhere, cause us to see patterns that aren't there, etc. etc.
Regardless, this is irrelevant to the original point that I was making, which is that comparing the performance of DL on human adjacent tasks to the amount of time it takes a human to learn the same task is misguided because you are ignoring the million-year long optimization process to get there.
Such research is the area of computational neuroscience - one thing that such people do is try to model parts of the brain (or just a single neuron) with computers.
A Neuron (=nerve cell in the brain) is a very complex beast. In rough terms they work like this: They collect signals (electrical impulses) via their small appendages called dendrites. when the sum of the signals reaches a certain threshold a large electrical impulse is generated at the cell body that will travel trough its "output" appendage (called axon) that connected to another neuron's cell body or to its dendrite.
Neurons display a dazzling variety in all these parameters:
- In morphology, e.g. they can look like a pine tree http://www.scholarpedia.org/article/Pyramidal_neuron (I really recommend scholarpedia, also this article has a nice animation on how electrical impulses propagate) or like a sea urchin.
- it really matters where the cell gets its impulse from: A neuron stimulated near its cell body will be much more sensitive to the input than being stimulated far away.
- Their response characteristics are wildly varied too. Some give off one large impulse, some a quick burst of impulses. Some are preventing others from giving out impulses from stimulation (inhibitor neurons)
- This whole mess can be modulated with chemical compounds that are released by the body -- some make some neurons more sensitive, some less.
- Also we still discover every year some new mechanism that modulates how they function.
The issue is that this results in such a complex system that a modern PC cant even simulate 1 detailed neuron model realtime (these tools are open source, try them out! for example https://neuron.yale.edu/ ). Now we know that we're simulating things that likely do not matter (e.g. we don't need a neuron model that consist of 10.000+ segments), but we do not know which parts we need to remove to have a faithful simulation. Also we might simply simulate some parts wrong because our knowledge of the subject is not enough.
But on the upside we've reached some great things already, for example we know how our brain calculates from our head and eye position the orientation of the things we're looking at
A lot. Parallel optimization is an art form. These models are trained on static datasets, they can't intervene in the environment to infer causal relations, so they need legs and hands.
There is a whole lot of new knowledge on how live neurons and networks of neurons work that had been collected in the last 75 years in the neuroscience domain but it’s mostly ignored by computer scientists.