A sleep-like consolidation mechanism for LLMs

A sleep-like consolidation mechanism for LLMs(arxiv.org)

212 points by juxtapose 14 days ago | 140 comments

The idea of periodically stopping to write blocks of recent context into a fast-weight state is interesting, but I think it liked it better when E2E-TTT[1] did it. It's a more flexible and elegant continuous learning approach.

Essentially it goes "You know how your model can remember its training data? Well, what if you treated its recent context like more training data and updated (some of) the weights using (mostly) the same process used to train it?"

The end result is very good at remembering things but also really good at adapting to new unseen distributions.

[1]https://arxiv.org/abs/2512.23675

samsartor 14 days ago | |

Yah I think E2E-TTT is a lot more like what people in this comments section are picturing. I can't tell that this method updates model weights at all during the "sleep" period, only the usual SSM state updated by any Mamba model after each token. They just optimized the model to use that SSM state _more_ when an eviction is about to happen.

soulofmischief 14 days ago | |

Each model needs to be a separate copy, or at least have those particular weights be interchangeable, for every single user.

Remember Microsoft Tay.

https://en.wikipedia.org/wiki/Tay_(chatbot)#Initial_release

thunderbird120 14 days ago | | |

Yes, since the weights being updated are a small subset of the overall total it's manageable. Just like how each separate conversation currently requires you to store a separate KV cache, you'd need to store the fast weights separately. Both KV cache and fast weight content stores have to be conversation specific, so just setting a bit of extra RAM aside for "memory" isn't really a new ask, just a different format for an old problem.

pfannkuchen 14 days ago | |

I wonder if we can get children to make something their life’s dream if we make the cool books about it when they are growing up? I wonder how flexible the human mind can be in convincing itself that it is fulfilling its dream?

knollimar 14 days ago | | |

This sounds like a horror novel

bmc7505 14 days ago |

This topic recently came up at the FLANN workshop [1], and seems to periodically be rediscovered [2,3,4] in different contexts. While some have speculated about the biological role it plays (e.g., Pearlmutter & Houghton [5]), we still lack a conclusive theory of sleep, but the convergent evolution of this specific phenomenon across the animal kingdom and the fact that deprivation is inevitably fatal seems like an important clue.

[1]: https://flann.cs.yale.edu

[2]: https://www.cs.toronto.edu/~hinton/csc2535/readings/ws.pdf

[3]: https://arxiv.org/abs/1711.02282

[4]: https://arxiv.org/abs/2006.08381

[5]: https://mural.maynoothuniversity.ie/id/eprint/1653/1/Hamilto...

swyx 14 days ago |

related preprint from the letta team https://arxiv.org/abs/2504.13171

Scaling test-time compute has emerged as a key ingredient for enabling large language models (LLMs) to solve difficult problems, but comes with high latency and inference cost. We introduce sleep-time compute, which allows models to "think" offline about contexts before queries are presented: by anticipating what queries users might ask and pre-computing useful quantities, we can significantly reduce the compute requirements at test-time. To demonstrate the efficacy of our method, we create modified versions of two reasoning tasks - Stateful GSM-Symbolic and Stateful AIME. We find that sleep-time compute can reduce the amount of test-time compute needed to achieve the same accuracy by ~ 5x on Stateful GSM-Symbolic and Stateful AIME and that by scaling sleep-time compute we can further increase accuracy by up to 13% on Stateful GSM-Symbolic and 18% on Stateful AIME. Furthermore, we introduce Multi-Query GSM-Symbolic, which extends GSM-Symbolic by including multiple related queries per context. By amortizing sleep-time compute across related queries about the same context using Multi-Query GSM-Symbolic, we can decrease the average cost per query by 2.5x. We then conduct additional analysis to understand when sleep-time compute is most effective, finding the predictability of the user query to be well correlated with the efficacy of sleep-time compute. Finally, we conduct a case-study of applying sleep-time compute to a realistic agentic SWE task.

micromacrofoot 14 days ago |

To reach a more brain-like behavior LLMs need to integrate your inputs into their model dynamically, essentially retraining real-time based on the most salient input. Human brains do this selectively all the time and it's part of our plasticity.

Biologically humans do similar compression, so introducing a similar concept to an LLM also feels reasonable. Hardware isn't fast/cheap enough to do this on an ongoing basis, similar to how it's too expensive for our brains to do this while we're moving through the world.

All we have now most of the time in LLMs is "working memory" we're missing a lot of the functionality that allows for episodic memory and selective plasticity.

The more you read about how human brains work, the more you realize that we may have figured out a piece with LLMs, but it's certainly nothing approaching AGI. People insisting so are blowing smoke for investor hype or don't understand a big piece of the concepts involved.

logicchains 14 days ago | |

>To reach a more brain-like behavior LLMs need to integrate your inputs into their model dynamically, essentially retraining real-time based on the most salient input.

That's already possible with LLMs. The challenge is that 1. it would allow permanently jail-breaking models and 2. there'd be no way for them to efficiently transfer what they'd learned to a new model generation.

micromacrofoot 14 days ago | | |

Oh do you have a source? I haven't seen it done in real-time.

Coincidentally the human brain is also jailbroken and nontransferable

elphard 14 days ago |

We should let them sleep with half a brain at a time like migrating birds.

jonnyasmar 14 days ago |

What happened to Claude's auto-dream? I thought it was brilliant.

rahen 14 days ago |

That's an idea I had a few months ago: after going through a compaction once the KV cache is nearing capacity, accumulate this knowledge into a dataset to fine-tune a LoRA during offline hours.

This would create a three-layer memory system:

- Stable long-term memory (initial base weights)

- Mid-term memory built from the compactions and replay buffers

- Short-term memory (KV cache)

Sleeping would just be a fancy term for consolidating and transferring information from one memory layer to another during offline hours. Maybe that's also what the brain does while sleeping.

chermi 14 days ago | |

Wouldn't that just accelerate collapse? How much do you trust the outputs of the llm to provide trustworthy and valuable new information? I mean I understand distillation works. But that's much more structured and thoughtful than my sessions at least.

jack_pp 14 days ago | | |

We can trust the feedback we give it based on the output it provides.

rahen 14 days ago | | |

I was thinking of curated replay buffers, which would act like "dreams". To prevent collapse, the offline dataset would mix the new mid-term data with a baseline of anchor data (the original training distribution) so the model doesn't drift.

Also, we wouldn't train on the whole session. A separate critic module, like a reward model, would filter the KV cache to extract the high-value information, like a garbage collector before the LoRA.

That's just an idea though. Right now most research focuses on changing the architecture itself (TITAN, HOPE...) instead.

DonHopkins 14 days ago | |

It's a network of computers with GPUs, so there's no reason it can't sleep at the same time it's awake. Just a continuous "sleeping" process going on in the background, incrementally updating the model. No need for the "thinking" process to be "unconscious" while the "sleeping" process runs. Anthropomorphism confuses everything. There's no such thing as "offline hours" because the Earth is a sphere and the United States is not the center of the universe.

fc417fc802 14 days ago | | |

> the Earth is a sphere and the United States is not the center of the universe.

Felt like stating the obvious there? Greenwich being the center of everything after all.

jgreid 14 days ago |

Isn't this simply context pruning/optimization?

kylemaxwell 14 days ago | |

From the abstract, it looks like it's actually doing something deeper, updating weights in part of the model?

samsartor 14 days ago | | |

The abstract and method sections only mention updating the SSM state during "sleep" (ie the same vectors that change after each token in stock Mamba) not any of the actual weight matrices. AFAICT this is just another attention compaction paper, with misleading tile? It is not very clearly written

colechristensen 14 days ago | |

No, they're actually training weights based on context before compaction. Context is context, this is splitting the model into persistent weights and malleable ones which are periodically updated.

delis-thumbs-7e 14 days ago | | |

Wouldn’t that be extremely computationaly expensive considering how resource incentive training is?

wagwang 14 days ago |

energy123 14 days ago |

Would be a big deal if you don't have to care about quadratic attention cost. Some workflows become a lot cheaper.

hmokiguess 14 days ago |

This could be a solution in search of a problem, I would be careful with overfitting.

mos765817 13 days ago |

wasn't this what Google did long ago? https://openreview.net/forum?id=iiZy6xyVVE

scotty79 14 days ago |

Context -> Lora would be soooo cool.

gt0 14 days ago |

This seems as much like "sleep" as when a laptop "sleeps".

m3kw9 14 days ago |

sleep aka processing the data differently.

m0unta1ntube 14 days ago |

why not just design the LLM like an OS?

IAmGraydon 14 days ago |

The entire industry is so desperate to anthropomorphize. What the paper describes is an offline recurrent consolidation phase: the model runs multiple forward passes over recently accumulated context, updates persistent fast weights in SSM blocks, then clears the KV cache before continuing. It has absolutely nothing to do with sleeping, but I believe the authors had a goal in mind when creating this title, and it was for journalists to pick it up and run with it, further inflating the AI-is-just-like-us hype bubble.

genxy 14 days ago | |

It is a descriptive analogy, get over yourself.

IAmGraydon 14 days ago | | |

An intelligent reply from an obviously intelligent guy!

A more appropriate title would have been something like "Offline Recurrent Memory Consolidation for Long-Context Language Models". This is supposed to be a research paper, not a story book. The title should give context to other researchers, and not be clearly engineered for clicks. If you don't think so, that's your prerogative, but you're objectively wrong.

semiinfinitely 14 days ago |

academic clickbait

pcrh 14 days ago |

I can't pretend to understand how LLMs work, but I can be sure that anthropomorphizing their functions is not helpful to an objective debate over their abilities.

Does a motor vehicle get "sleep" when it is serviced? When I reboot a computer, is that equivalent to a nap?

hansmayer 14 days ago |

Sweet Jesus, so not only are they performing qualitatively worse than humans, too expensive for any serious work, but now they also "need" to sleep? What's next - unionisation so they can enjoy 8 hours of culture too?

victorkulla 14 days ago |

No they do not. I'm sure that if you presented the same argument about, I don't know?, your car's CPU with built in AI; then this would be a whole different discussion entirely.

danielrmay 14 days ago |

The "sleep" thing gives me the creeps so in my head I'm just going to think of it as the difference between "response time retrieval" and "background consolidation".

I do think it points at something bigger than just attention architecture: "memory" isn't just storage, and merely longer context isn't the same thing as having a better understanding of the source data.

I'm looking at this through the "personal AI" lens, where I think the missing "memory" layer seems to be consolidation & prioritization. It's not enough to just pattern match and grab the right emails, notes, etc, stuff them into the context window & hope, but instead it's useful to consider offline processing and turn events into durable state: clusters of observed data becomes episodes, assumptions, contradictions and power confidence for suggestions.

That also pushes up the need for provenance & inspectability. It's going to be interesting to see what kind of memory consolidation strategies are required for each domain use case.

sonink 14 days ago | |

I think you are missing the most important part - forgetting. The missing "memory" layers is consolidation, prioritization AND forgetting (what is not important).

Also not too sure about provenance and inspectability - it is part of memory. If the source is deemed 'important' it will survive forgetting. If not, then maybe not. And its ok. I am sure you dont know the exact source who told you that the capital of France is Paris. You forgot, and its no big deal.