You can avoid this with double buffering. At ~70% capacity, summarize into a checkpoint and start a back buffer. Keep working. Append new messages to both. When the active context hits the wall, swap. The new context has compressed old history + full-fidelity recent messages.
Same single summarization call you'd make anyway, just earlier — when the model isn't at the attention cliff. 40-year-old technique (graphics, databases, stream processing). Nobody had applied it to LLM context. Worst case degrades to exactly today's status quo.