Attention Sinks in LLMs for endless fluency(huggingface.co) |
Attention Sinks in LLMs for endless fluency(huggingface.co) |
It can be applied to pretrained LLMs with little to no additional effort, and Hugging Face transformers is working on first-party support. Until then, the third-party module in the blogpost already works well.
The clue is really that these tokens are just used to "offload" attention scores - their semantic meaning is irrelevant.