DeepMind’s new AI with a memory outperforms algorithms 25 times its size

DeepMind’s new AI with a memory outperforms algorithms 25 times its size(singularityhub.com)

324 points by darkscape 4 years ago | 132 comments

whazor 4 years ago |

Very interesting. GPT-J is an opensource free alternative to GPT-3 and requires at least 12.1GB memory to run the model (which is reduced from original 48GB ram). But if the model stores some kind of index and does internet searches (or hard drive) instead, then it could scale much further as there is a limit on how much memory you can use in production.

endymi0n 4 years ago | |

Just doing some napkin math, the whole GPT-J corpus was around 500 billion tokens, which at 4 tokens per byte would be roundabout 2 Terabyte. That, parked on a fast NVMe SSD will give you roundabout 1MM random lookups per second. Even with some transfers inbetween, this should be more than enough to not just perform in equal time, but probably less — as well as cost you less than the GPU you need for the reduced size model.

Exciting times.

GistNoesis 4 years ago | | |

The real problem with (NVMe) SSD is that they have a limited number of write cycles (a max TB written).

If you don't update your database and indices they are great. But that's something really tempting to do when you do some machine learning, (specially if you know that people with deeper pockets will do so).

Typically you will have a neural network, you run it on your dataset, it produces a new dataset of embeddings, you index them, and you use this index to train a new neural network, and you repeat the loop, hopefully improving results along the way.

NVMe SSD can write at 6GB/s but can only write ~800TB that's about 37 hours of lifetime at max speed.

iggldiggl 4 years ago | | |

> Just doing some napkin math, the whole GPT-J corpus was around 500 billion tokens, which at 4 tokens per byte would be roundabout 2 Terabyte.

"Only" 825 GB actually: https://pile.eleuther.ai/

A not-insignificant fraction of that is definitively copyrighted material, though, which raises some interesting questions when switching to a model of distributing "a smaller trained model plus the original raw training data" (though it seems that the team behind GPT-J are clearly happy to distribute their full set of data anyway, and seem to be enough under the radar to not attract the wrong sort of attention,at least for now).

sbierwagen 4 years ago | |

>48GB ram

48GB VRAM? 48+ gigabytes of system ram is cheap, 48 gigabytes of ram on a GPU is still painfully expensive.

moffkalast 4 years ago | | |

Yes, in this GPU market that's essentially a new car's worth of cash.

JudasGoat 4 years ago | | |

The Amd APU's would be interesting although under powered. They give you the option of setting "VRAM" size to almost any percentage of system memory.

schleck8 4 years ago | | |

GPUs are the slowing factor in general if I'm not mistaken when it comes to Deep Learning progress.

charcircuit 4 years ago | | |

Yes, vram

spywaregorilla 4 years ago | |

Isn't this what watson used to do?

neom 4 years ago |

A neural net with access to wikipedia is faster than than a neural net that contains Wikipedia? Seems odd to call it AI with a memory though... unless I'm misunderstanding. It's more like AI with a decent memory and an understanding of how to use an encyclopedia.

robbedpeter 4 years ago | |

Yeah, memory implies persisted state in the model, this is static lookups separate from the transformer.

Still superb, though, there's no reason you can't use other gofai tools vs a static database, to trigger expert systems or formalized reasoning.

visarga 4 years ago | | |

It's not gofai. It's locality sensitive hashing published in 2008.

neom 4 years ago | | |

I got into a very long debate with an openai person 4/5 years ago about this + adversarial learning + access to a quantum computer (think just straight up world class abacus) was close to the primitives required for more generalized AI. They didn't agree with me, but that's ok! :)

LoveMortuus 4 years ago | |

I wonder what we can learn from AI models about how humans work.

Like, could we assume that for humans it's also faster to search for information on Wikipedia or would it be faster to recall from memory of already read Wikipedia? Although with humans stored information decay is present. (In a way a human form of garbage collection :P).

meiji163 4 years ago |

I'd be interested to see if these models are robust against algorithms like TextFooler [0]. I'm skeptical this trend of 10x'ing the parameters will solve the "clever hans" problem.

[0]: https://github.com/jind11/TextFooler

amitport 4 years ago |

Dup https://news.ycombinator.com/item?id=29486607

(This is a different blogpost, but does not seem to add over the original)

Edit: following derac's comment see https://news.ycombinator.com/item?id=29646112 for RETRO

derac 4 years ago | |

This article is about RETRO [0], not Gopher.

[0] https://deepmind.com/research/publications/2021/improving-la...

davefol 4 years ago |

Seems odd to claim 25x reduction in size when the algo involves looking into a database of a trillion chunks of text.

visarga 4 years ago | |

The "algo" here refers to the neural net itself. The text index is considered an easy problem as you can do lookups in logarithmic time.

ehsankia 4 years ago | |

The word "Algo" here is definitely awkward. The point is though that what matters most here is the number of parameters, as those correlate quite closely with training and inference time. Storage space is pretty trivial, but TPU cycles are less so.

davefol 4 years ago | | |

Thanks for this. Rereading + your comment and I think I have a better understanding of why this is progress.

AJRF 4 years ago |

I've not kept track of where large transformers like this have gotten to, GPT3 and the like - has GPT3 made any real difference to the world? Are people using it? Has it vastly improved any software?

tveita 4 years ago | |

It's a safe bet that Google is using transformers at scale for search and translations - the full extent isn't public but they release a fair amount of research papers, e.g. the current article, or https://ai.googleblog.com/2020/06/recent-advances-in-google-...

Github Copilot is definitely GPT-3-based and is seeing real-world use https://copilot.github.com

Transformers are state of the art for many tasks so they are likely to be used for "intelligent" processing of text or speech data, but due to practical limitations you are probably interacting with them mostly through web services.

muzani 4 years ago | |

I don't know about world changing but it's saved me hundreds of hours. I use it to help read academic papers, put formatting on things like markdown and subtitles, and creative writing. A lot of the things that take it 15 seconds to do take me 2 minute and drain me mentally for about 15 mins.

If anything, it's being used in force for social media marketing, where you're trying to say "buy this thing" in different ways every day.

regularfry 4 years ago | | |

Forgive the ignorance, but how? What tools are you using on top of GPT3 to do those things?

wiz21c 4 years ago | | |

Seconding someone else's comment : what is your workflow for those tasks ? How does it help you to read academic papers ? Or to put formatting on markdown ?

ggm 4 years ago |

If we point it at the horrendously bad scots wiki (some kid in the US decided he'd translate Wikipedia into what he thought was lowland scots/Doric.. it's a disaster) we might get entertainingly bad outcomes.

buro9 4 years ago | |

Oh wow, that's a fun rabbit hole: https://www.theguardian.com/uk-news/2020/aug/26/shock-an-aw-...

bawolff 4 years ago | |

Note, the stuff written by said kid has long since been deleted. However i have no idea what the quality of the rest of scowiki is.

ggm 4 years ago | | |

It's not awful, but I feel it's still pretty meh. I say this as a person raised in Edinburgh in the sixties and seventies. How bad? Well.. in their backend meta pages they link to the DSL (Dictionary of the scots language/dictionars o' Scots Leid [0]) which says this:

Written Scots In the written mode, Scots spelling remains variable. Attempts to make it more consistent, notably the Scots Style Sheet produced by the Makars’ Club in 1947 or the Recommendations for Writers in Scots published by the Scots Language Society in 1985, have had at best only limited success, competing with other systems that have been developed to represent more closely localized varieties of spoken Scots.

When your reference text says the language isn't yet well captured in a single print, you better believe the wiki page is a hot mess.

[0] https://dsl.ac.uk/about-scots/what-is-scots/

LudwigNagasena 4 years ago | | |

Well, it is pretty hard to make something in a language when it is a dialect continuum and not a standardized variety that is forced onto the whole population through the education system and media.

jari_mustonen 4 years ago |

Could someone explain the article to layman engineer?

visarga 4 years ago | |

It's language modelling with search engine in-the-loop.

Instead of training GPT-3 with 178B weights, you train a 25x smaller model and allow it to retrieve useful snippets from a large text index as additional information.

This solves the problem of very large models and the problem of updating an already trained model, as you can swap the text corpus with a newer one. The model learns mostly syntax, burning less trivia in its weights than a regular LM as it can simply copy the relevant information from the index.

This development was bound to happen as large LMs are expensive to use and it was an obvious idea. We've had these semantic search text indices for a few years already[1], they just weren't combined with text generation.

[1] https://github.com/spotify/annoy

alkonaut 4 years ago | | |

So the memory doesn't solve the context problem of e.g. "conversation context"? I.e. the storage isn't modified while the model is used? If I make an app that makes conversation using such a model model, then the storage isn't modified to insert knowledge about what the early parts of the conversation was about, and it's only bringing a database of fixed information into the conversation? (I have a friend who is just like that).

jamesblonde 4 years ago | | |

Yes, the key technology here is a scalable embedding store. The leading players here are the indexes - faiss and scann. The open source platforms are opensearch, elasticsearch, featureform, milvius. Then there are saas products like pinecone.

Blikkentrekker 4 years ago |

> Gebru, a widely respected leader in AI ethics research, is known for coauthoring a groundbreaking paper that showed facial recognition to be less accurate at identifying women and people of color, which means its use can end up discriminating against them.

Surely this is a function of location? I understand the U.S.-English term “person o color” to be convoluted language for “not white”. One simple thing I notice is that if I search for, say, “child” on Google Image Search, the images indeed tend to look as what one would expect from the average inhabitant of an English-speaking nation, when I search “子供”, I indeed mostly see what I would expect from Japan. Similarly, if I search for “house”, what I find tends to look like a house most likely situated in the Netherlands; with “บ้าน”, it does resemble more so stereotypical Thai architecture.

I would assume that a.i.'s made in, say, Japan would yield different results.

baalimago 4 years ago |

Okay. Now make the small AI with memory 25 times bigger!

rapjr9 4 years ago |

This seems like a very interesting approach to creating an AI that can continuously learn new things by just updating its database. Maybe a first step towards a general purpose AI? It would be interesting to create a personal assistant based on this whose database was fed the entire digital stream generated by a persons life. How would you protect such an AI from misuse? Add another AI with a database of information on ethics that acts as a gatekeeper? Could you somehow keep the gatekeeper from being turned off, perhaps by using cryptography in some fashion for access control?

lebuffon 4 years ago |

Could we say that they are re-inventing the human mind architecture by enhancing "fluid intelligence" with "crystallized intelligence".

As humans age we apparently lose the former but compensate with the latter as best we can.

mik09 4 years ago |

oftentimes one can shrink a model down dramatically once one has a bigger, more robust model. but shrinking a huge model is still a great achievement.

charcircuit 4 years ago |

So where can we download these models?

rocgf 4 years ago | |

Given that this is DeepMind and not some more open AI organization, I assume you cannot.

kkjjkgjjgg 4 years ago |

Sounds as if they stored all the correct answers in a database and call it "better". How do they even evaluate these models? Like they already have a billion preprepared correct answers in the database. How do they come up with new questions for the evaluation?

charcircuit 4 years ago | |

It's the equivalent of taking an a test where you can use the internet. Sure you know the information needed to answer the question exists, but it can be difficult to extract the answer and word it into at English sentence.

blovescoffee 4 years ago | |

Instead of storing the correct answers in an encoded/embedded form in the weights of the neural net (certain neurons very loosely corresponding to certain "answers") the correct answers are stored elsewhere. That way we can scale down the model to the necessary "thinking" parts and we don't need to use excess neurons for the "memory" part. Kind of handwavey but hopefully that explains the general idea.

kkjjkgjjgg 4 years ago | | |

You mean otherwise the whole words would be encoded in the net, and now you only need to encode the index in the database?

tasty_freeze 4 years ago | |

> all the correct answers

That is clearly not possible, so it can't be what they are doing.

Rather than diffusely encoding that knowledge in a massive number of self-organized layers of weights, it is explicitly encoded. The remaining network can "focus" on mapping input to retrieve the relevant information stored in that database, and extracting/interpolating/extrapolating that information based on the current context to generate useful output.