GPT-4o's Memory Breakthrough – Needle in a Needlestack

GPT-4o's Memory Breakthrough – Needle in a Needlestack(nian.llmonpy.ai)

478 points by parrt 2 years ago | 239 comments

This is based on a limericks dataset published in 2021. https://zenodo.org/records/5722527

I think it very likely that gpt-4o was trained on this. I mean, why would you not? Innnput, innnput, Johnny five need more tokens.

I wonder why the NIAN team don't generate their limericks using different models, and check to make sure they're not in the dataset? Then you'd know the models couldn't possibly be trained on them.

sftombu 2 years ago | |

I tested the LLMs to make sure they could not answer the questions unless the limerick was given to them. Other than 4o, they do very badly on this benchmark, so I don't think the test is invalidated by their training.

cma 2 years ago | | |

Why wouldn't it still be invalidated by it if it was indeed trained on it? The others may do worse and may or may not have been trained on it, but them failing on ititself doesn't imply 4o can do this well without the task being present in the corpus.

dontupvoteme 2 years ago | | |

It would be interesting to know how it acts if you ask it about one that isn't present, or even lie to it (e.g. take a limerick that is present but change some words and ask it to complete it)

Maybe some models hallucinate or even ignore your mistake vs others correcting it (depending on the context ignoring or calling out the error might be the more 'correct' approach)

Using limericks is a very nifty idea!

neverokay 2 years ago | |

Why not just generate complete random stuff and ask it to find stuff in that?

Kostchei 2 years ago | | |

We have run that test.- generate random string(not by llm) names of values- ask the llm to do math (algebra) using those strings. Tests logic, 100% not in the data set GPT2 was like 50% accurate, now we up around the 90%.

dontupvoteme 2 years ago | |

NIAN is a very cool idea, but why not simply translate it into N different languages (you even can mix services, e.g. deepl/google translate/LLMs themselves) and ask about them that way?

internet101010 2 years ago | |

No disassemble!

bearjaws 2 years ago |

I just used it to compare two smaller legal documents and it completely hallucinated that items were present in one and not the other. It did this on three discrete sections of the agreements.

Using ctrl-f I was able to see that they were identical in one another.

Obviously this is a single sample but saying 90% seems unlikely. They were around ~80k tokens total.

carlosbaraza 2 years ago | |

I have the same feeling. I asked to find duplicates in a list of 6k items and it basically hallucinated the entire answer multiple times. Some times it finds some, but it interlaces the duplicates with other hallucinated items. I wasn't expecting it to get it right, cause I think this task is challenging with a fixed amount of attention heads. However, the answer seems much worse than Claude Opus or GPT-4.

akomtu 2 years ago | | |

Everyone is trying to use Language Models as Reasoning Models because the latter haven't been invented yet.

fnordpiglet 2 years ago | |

That’s not needle in a haystack.

I would note that LLMs handle this task better if you slice the two documents into smaller sections and iterate section by section. They aren’t able to reason and have no memory so can’t structurally analyze two blobs of text beyond relatively small pieces. But incrementally walking through in much smaller pieces that are themselves semantically contained and related works very well.

The assumption that they are magic machines is a flawed one. They have limits and capabilities and like any tool you need to understand what works and doesn’t work and it helps to understand why. I’m not sure why the bar for what is still a generally new advance for 99.9% of developers is effectively infinitely high while every other technology before LLMs seemed to have a pretty reasonable “ok let’s figure out how to use this properly.” Maybe because they talk to us in a way that appears like it could have capabilities it doesn’t? Maybe it’s close enough sounding to a human that we fault it for not being one? The hype is both overstated and understated simultaneously but there have been similar hype cycles in my life (even things like XML were going to end world hunger at one point).

HarHarVeryFunny 2 years ago | |

That's a different test than needle-in-a needlestack, although telling in how brittle these models are - competent in one area, and crushingly bad in others.

Needle-in-a-needlestack contrasts with needle-in-a-haystack by being about finding a piece of data among similar ones (e.g. one specific limeric among thousands of others), rather than among disimilar ones.

1970-01-01 2 years ago | |

I've done the same experiment with local laws and caught GPT hallucinating fines and fees! The problem is real.

tmaly 2 years ago | | |

Imagine if they started using LLMs to suggest prison sentences

Aerbil313 2 years ago | |

Interesting, because the (at least the official) context window of GPT-4o is 128k.

davedx 2 years ago | |

> Obviously this is a single sample but saying 90% seems unlikely.

This is such an anti-intellectual comment to make, can't you see that?

You mention "sample" so you understand what statistics is, then in the same sentence claim 90% seems unlikely with a sample size of 1.

The article has done substantial research

dkjaudyeqooe 2 years ago | | |

That fact that it has some statistically significant performance is irrelevant and difficult to evaluate for most people.

He's a much simpler and correct description that almost everyone can understand: it fucks up constantly.

Getting something wrong even once can make it useless for most people. No amount of pedantry will change this reality.

lopuhin 2 years ago | | |

And also article is testing on a different task (Needle in a Needlestack which is kind of similar to Needle in a Haystack), compared to finding a difference between two documents. For sure it's useful to know that the model does ok in one and really bad in the other, does not mean that original test is flawed.

bckr 2 years ago | |

Yeah I asked for an estimate of the percentage of the US population that lives in the DMV area (DC, Maryland, Virginia) and it was off by 50% of the actual answer, which I only realized when I realized I shouldn’t trust its estimate for anything important

KeplerBoy 2 years ago | | |

Those models still can't reliably do arithmetic, so how could it possibly know that number unless it's a commonly repeated fact?

Also: would you expect random people to fare any better?

kylebenzle 2 years ago | |

What you are asking an llm to do here makes no sense.

potatoman22 2 years ago | | |

Why not? It seems like a natural language understanding task

marshray 2 years ago | | |

You haven't seen the promotion of the use of LM AI for handling legal documents?

It's purported to be a major use case.

cmrdporcupine 2 years ago | | |

You might be right but I've lost count of the number of startups I've heard of trying to do this for legal documents.

thorum 2 years ago |

The needle in the haystack test gives a very limited view of the model’s actual long context capabilities. It’s mostly used because early models were terrible at it and it’s easy to test. In fact, most recent models now do pretty good at this one task, but in practice, their ability to do anything complex drops off hugely after 32K tokens.

RULER is a much better test:

https://github.com/hsiehjackson/RULER

> Despite achieving nearly perfect performance on the vanilla needle-in-a-haystack (NIAH) test, all models (except for Gemini-1.5-pro) exhibit large degradation on tasks in RULER as sequence length increases.

> While all models claim context size of 32k tokens or greater (except for Llama3), only half of them can effectively handle sequence length of 32K by exceeding a qualitative threshold, Llama2-7b performance at 4K (85.6%). The performance exceeding the threshold is underlined.

WhitneyLand 2 years ago | |

Maybe, but

1. The article is not about NIHS it’s their own variation so it could be more relevant.

2. The whole claim of the article is that Gpt4o does better, but the test your pointing to hasn’t benchmarked it.

sftombu 2 years ago | |

The models benchmarked by RULER do worse in needle in a needlestack. It will be interested to see how 4o does with RULER.

19h 2 years ago |

I'd like to see this for Gemini Pro 1.5 -- I threw the entirety of Moby Dick at it last week, and at one point all books Byung Chul-Han has ever published, and it both cases it was able to return the single part of a sentence that mentioned or answered my question verbatim, every single time, without any hallucinations.

youssefabdelm 2 years ago |

Someone needs to come up with a "synthesis from haystack" test that tests not just retrieval but depth of understanding, connections, abstractions across diverse information.

When a person reads a book, they have an "overall intuition" about it. We need some way to quantify this. Needle in haystack tests feel like a simple test that doesn't go far enough.

yatz 2 years ago |

Well, I can now use GPT to transform raw dynamic data into beautiful HTML layouts on the fly for low-traffic pages, such as change/audit logs, saving a ton of development time and keeping my HTML updated even when the data structure has changed. My last attempt did not consistently work because GPT4-Turbo sometimes ignored the context and instructions almost entirely.

ijidak 2 years ago | |

Do you have an example of this? I would love to learn more.

yatz 2 years ago | | |

Here is the entire prompt. I used rules to ensure the formatting is consistent as otherwise sometimes it might format date one way and other times in an entirely different way.

Imagine, a truly dynamic and super personal site, where layout, navigation, styling and everything else gets generated on the fly using user's usage behavior and other preferences, etc. Man! ---------------------------------------------

{JSON} ------ You are an auditing assistant. Your job is to convert the ENTIRE JSON containing "Order Change History" into a human-readable Markdown format. Make sure to follow the rules given below by letter and spirit. PLEASE CONVERT THE ENTIRE JSON, regardless of how long it is. --------------------------------------------- RULES: - Provide markdown for the entire JSON. - Present changes in a table, grouped by date and time and the user, i.e., 2023/12/11 12:40 pm - User Name. - Hide seconds from the date and time and format using the 12-hour clock. - Do not use any currency symbols. - Format numbers using 1000 separator. - Do not provide any explanation, either before or after the content. - Do not show any currency amount if it is zero. - Do not show IDs. - Order by date and time, from newest to oldest. - Separate each change with a horizontal line.

balder1991 2 years ago | | |

I guess you just need to offer a template in the prompt? Then maybe some validation after.

parrt 2 years ago |

The article shows how much better GPT-4o is at paying attention across its input window compared to GPT-4 Turbo and Claude-3 Sonnet.

We've needed an upgrade to needle in a haystack for a while and this "Needle In A Needlestack" is a good next step! NIAN creates a prompt that includes thousands of limericks and the prompt asks a question about one limerick at a specific location.

mianos 2 years ago | |

I agree, I paid for Claude for a while. Even though they swear the context is huge and having a huge context uses up tokens like crack, it's near useless when source code in context just a few pages back. It was so frustrating as everything else was as good as anything and I liked the 'vibe'.

I used 4o last night and it was still perfectly aware of a C++ class I pasted 20 questions ago. I don't care about smart, I care about useful and this really contributes to the utility.

whimsicalism 2 years ago |

Increasingly convinced that nobody on the public internet knows how to do actual LLM evaluations.

tedeh 2 years ago | |

I'm just glad that we are finally past the "Who was the 29th president of the United States" and "Draw something in the style of Van Gogh" LLM evaluation test everyone did in 2022-2023.

petulla 2 years ago |

You need to know that this test set data wasn't included in the training data for this to be meaningful.

sftombu 2 years ago | |

If you ask the questions without providing the limerick first, it never gets the right answer. When the LLM gets the wrong answer, it is usually because it reverts to its training data and gives a generic answer that doesn't apply to the limerick.

trifurcate 2 years ago | | |

Why are you ruling out the possibility that training on the material may confer an advantage when the data is presented, even if the advantage may not be strong enough to pass the test without the data present in the context window?

a_wild_dandan 2 years ago | |

No you don't. Compare the model's performance before and after uploading the material.

sftombu 2 years ago | | |

Previous answer to this question:

https://news.ycombinator.com/item?id=40361419s

lmeyerov 2 years ago | |

I thought the test limericks were autogenerated?

sftombu 2 years ago | | |

They come from a database of 98k limericks -- https://zenodo.org/records/5722527

personjerry 2 years ago |

That's great to hear. My biggest issue with GPT-4.0 was that as the conversation got longer, the quality diminished (especially relevant for coding projects)

I wonder if it'll be better now. Will test today.

throwthrowuknow 2 years ago | |

That’s been my experience so far. My current conversations are crazy long compared to any of my gpt4 convos which I had to frequently copy context from and start over in a new chat

sftombu 2 years ago | |

I had the same experience. With a 16k prompt, Turbo was nearly flawless. But it wasn't very good at 32k and not usable at 100+. You have to repeat information to get good results with longer prompts

itissid 2 years ago |

How Do we know that gpt-4o.has not been trained on this dataset?

sftombu 2 years ago | |

Previous answer to this question:

https://news.ycombinator.com/item?id=40361419

throwthrowuknow 2 years ago |

This is a very promising development. It would be wise for everyone to go back and revise old experiments that failed now that this capability is unlocked. It should also make RAG even more powerful now that you can load a lot more information into the context and have it be useful.

demilich 2 years ago | |

Agreed

feverzsj 2 years ago |

LLMs are still toys, no one should treat them seriously. Apparently, the bubble is too massive now.

infecto 2 years ago | |

We have businesses getting real value from these toys. Maybe you have not been in the right circles to experience this?

feverzsj 2 years ago | | |

Of course you can get value from toy business, but toys are toys.

nopromisessir 2 years ago | |

Used toys to write a working machine vision project over last 2 days.

Key word: working

The bubble is real on both sides. Models have limitations... However, they are not toys. They are powerful tools. I used 3 different SotA models for that project. The time saved is hard to even measure. It's big.

SiempreViernes 2 years ago | | |

> The time saved is hard to even measure. It's big.

You are aware that this is an obvious contradiction, right? Big times savings are not hard to measure.

cdelsolar 2 years ago | |

Must be a pretty cool toy; it constantly 10X’s my productivity.

nopromisessir 2 years ago | | |

You said it mate. I feel bad for folks who turn away from this technology. If they persist... They will be so confused why they get repeatedly lapped.

I wrote a working machine vision project in 2 days with these toys. Key word: working... Not hallucinated. Actually working. Very useful.

davedx 2 years ago | | |

It's staggering to me that people on Hacker News are actually downvoting people saying how AI is boosting productivity or levering business or engineering or finance. The denial, cynicism and sheer wilful ignorance is actually depressing. I get that not everyone is working directly with AI/ML but I honestly expected better on a website about technology.

People are deliberately self selecting themselves out of the next industrial revolution. It's Darwin Awards for SWE careers. It's making me ranty.

sschueller 2 years ago |

We are all so majorly f*d.

The general public does not know nor understand this limitation. At the same time OpenAI is selling this a a tutor for your kids. Next it will be used to test those same kids.

Who is going to prevent this from being used to pick military targets (EU law has an exemption for military of course) or make surgery decisions?

causality0 2 years ago |

I don't understand OpenAI's pricing strategy. For free I can talk to GPT 3.5 on an unlimited basis, and a little to GPT 4o. If I pay $20 a month, I can talk to GPT 4o eighty times every three hours, or once every two and a half minutes. That's both way more than I need, and way less than I would expect for twenty dollars a month. I wish they had a $5 per month tier that included, say, eighty messages per 24-hours.

hackerlight 2 years ago | |

It'll make more sense when they deploy audio and image capability to paying users only, which they say they're going to do in a few weeks

causality0 2 years ago | | |

Yeah, but I want a tier where I have access to it in a pinch, but won't feel guilty for spending the money and then going a whole month without using it.

whereismyacc 2 years ago |

I always thought it seemed likely that most needle in a haystack tests might run into the issue of the model just encoding some idea of 'out of place-ness' or 'significance' and querying on that, rather than actually saying something meaningful about generalized retrieval capabilities. Does that seem right? Is that the motivation for this test?

tartrate 2 years ago |

Are there any prompts/tests about recalling multiple needles (spread out) at once?

For example, each needle could be a piece to a logic puzzle.

ammar_x 2 years ago |

The article compares GPT-4o to Sonnet from Anthropic. I'm wondering how Opus would perform at this test?

throw7381 2 years ago |

Anyone has done any benchmarks for RAG yet?

ionwake 2 years ago |

I am in England, do US users have access to memory features? ( Also do you ahve access to voice customisation yet?

Thanks

rob137 2 years ago | |

I am in England, on the 'Team Plan'* and got access to memory this week.

* https://openai.com/index/introducing-chatgpt-team/

ionwake 2 years ago | | |

Thank you!

sumedh 2 years ago | |

memory features are available in Australia.

nickca 2 years ago |

Would love to see Gemini there too!

cararemixed 2 years ago |

What's the chance that these limericks are now in the training set? As others mention, it'd be interesting to come up with a way to synthesize something sufficiently interesting so it always evades training fit.

sftombu 2 years ago | |

Previous answer to this question:

https://news.ycombinator.com/item?id=40361419

causal 2 years ago | | |

Your test is a good one but the point still stands that a novel dataset is the next step to being sure.

asadm 2 years ago |

I have had good experience with Gemini 1M context model with this kind of tasks.

croes 2 years ago |

>Needle in a Needlestack is a new benchmark to measure how well LLMs pay attention to the information in their context window

I asked GPT-4o for JavaScript code and got Python, so much for attention.

kolinko 2 years ago | |

What was your query?

rguptill 2 years ago |

We also need a way to determine where a given response fits in the universe of responses - is it an “average” answer or a really good one

edmara 2 years ago | |

If you have an evaluation function which does this accurately and generalizes, you pretty much already have have AGI.

m3kw9 2 years ago |

One could have LLM to route it to a text search function and have the function report back to the LLM for secondary processing.

dmose2 2 years ago |

It's interesting (though perhaps not surprising) to see the variance in curve shape across models.

m3kw9 2 years ago |

I thought google Gemini had almost perfect needle in haystack performance inside 1 million tokens?

sftombu 2 years ago | |

The reason I made Needle in a needlestack is the LLMs are getting to good at needle in a haystack. Until GPT-4o, no model was good at the NIAN benchmark.

DeathArrow 2 years ago |

I wonder how llama3 is doing.

pojzon 2 years ago |

Meh still for a lot of stuff it simply lies.

Just today it lied to me about VRL language syntax, tryin to sell me some python stuff in there.

Senior ppl will often be able call out the bullshit, but I believe for junior ppl it will be very detrimental.

Nether the less amazing tool for d2d work if you can call out BS replies.

8thcross 2 years ago |

These benchmarks are becoming like the top 10 lists you find on the internet. I agree that everything has a space, but frankly how many of us need a test that tells you that this is great at limericks?

EGreg 2 years ago |

I think large language models can be used to classify people, lying, or saying, rehearsed, things or being disingenuous. Simply train them on a lot of audio of people talking, and they would become better than most polygraph machines. There’s something about how a person says something that quickly reveals that it was rehearsed earlier, or premeditated, and I’m sure when they’re lying there can be things like that too. the LLM can instantly pick up with some probability and classify it

I’ve seen claims during open AI demo that is there software can now pick up on extremely subtle emotional clues, how people speak. Then, it shouldn’t take much more to make it read between the lines and understand what people are intending to say, for example, by enumerating all possible interpretations and scoring them based on, many factors, including the current time, location, etc. In fact, by taking into account so much context in factors, the LLM‘s will be better than people the vast majority of the time understanding what a person meant, assuming they were genuinely trying to communicate something.

it will become very hard to lie because everyone’s personal LLM will pick up on it fairly quickly, and find tons of inconsistencies, which it will flag for you later. You will no longer be fooled so easily, and if it has the context of everything the person has said publicly, plus if the person gives permission for your LLM to scan everything they’ve said privately because you’re their Business partner or sexual partner, it can easily catch you in many lies and so on.

I predict that in the next 5 to 10 years, human society will completely change as people start to prefer machines to other people, because they understand them so well, and taken into account, the context of everything they’ve ever said. They will be thoughtful, remembering details about the person in many different dimensions, and use them to personalize everything. By contrast, the most thoughtful husband or boyfriend will seem like, a jerk seems now. Or a cat.

Humor and seductive conversation, will also be at a superhuman standards. People will obviously up their game too, just like when they do when playing the game go after Lee Sedol was totally destroyed by Alpha go, or when people start using Alpha Zarro to train for Chess. However, once the computers understand what triggers people to laugh or have sexual response, they will be able to trigger them a lot more predictively, they simply need more training data.

And bullshitting will be done on a completely different level. Just like people no longer walk to destinations but use cars to go thousands of miles a year, similarly people won’t interact with other people so much anymore. The LLM’s, trained to bullshit 1000 times better than any human, Will be undetectable and gradually shift public opinion as open source models will power swarms of accounts.

Analyzing surgical field... Identified: open chest cavity, exposed internal organs Organs appear gooey, gelatinous, translucent pink Comparing to database of aquatic lifeforms... 93% visual match found: Psychrolutes marcidus, common name "blobfish" Conclusion: Blobfish discovered inhabiting patient's thoracic cavity Recommended action: Attempt to safely extract blobfish without damaging organs