You should put effort into getting them in a good place and accept the token levels (which is part of that design space).
This plugin slims descriptions to one-liners like "Read file content." while cutting 21-45% of token usage. No schema changes, no custom tools. Just trimmed boilerplate as an opt-in plugin.
Only if you are doing it wrong, search >>> summarization
Then the other question, is it deterministic between runs or am I going to get a different summary each session, turn, or toolcall? And depending on that frequency, am I using more token than I save by doing summarization for N tools?
Minimizing token usage is not the goal in of itself, re: the ageless tradeoff of quantity vs quality
For some context, my system prompt is around 5k tokens at the start. I put file contents there read/write/agents.md, which save millions of tokens and seems to work better than making them message parts.
> Just trimmed boilerplate
This is not what I see this tool doing. It's automatically manipulating words in the background that you should put far more care and attention towards. Referring to those words as "boilerplate" you can just throw into a slop machines is revealing
The descriptions aren't dynamically summarized either. They're static in the plugin, same every call, every session. Zero overhead, fully deterministic.
This has been validated in over 3000 benchmark runs in OpenCode and I ran the entire Exercism Python practice suite (https://github.com/exercism/python/tree/main/exercises/pract...) with and without the plugin with identical results. An initial dataset is shared in the repo.
> with identical results
If your results are identical, you should be very sus, something is wrong if this is true. Nothing in agentic is reliable of fully deterministic
The full benchmarking methodology and tooling will be published alongside the paper.
words matter
which is why I still think this is a terrible idea, I don't think it holds up in the general case and would, as a peer reviewer, be inclined to believe there is benchmark filtering that makes for good results.
You should use the same benchmarks everyone else is when you write your paper