You can learn about them evaluating that site https://github.com/Alexhans/eval-ception and then the pattern should be easy to test on your own thing.
It's agents all the way down!
Submit a GitHub repo containing skills to Tessl, and it will generate the evals, run them, and present the results. https://tessl.io/registry/skills/submit
The evals and results are all shown, no login necessary, so you can assess them yourself. e.g. https://tessl.io/registry/skills/github/coreyhaines31/market... (click details to see the eval texts).
But since it's not, what I do to avoid working on AGENTS.md blind is I test it on whatever causes me to write it.
I have some prompt, the AI messes it up in some way that I think it shouldn't, maybe it's something I've seen it do before and I'm sick of it. So I update AGENTS.md, revert the changes, /undo in the chat context and re-submit the same prompt.
What do you think would resonate with you or with the audience you're thinking about?
That repo also has an illustrative eval for Agent Skill in Airflow for Localization
https://github.com/Alexhans/eval-ception/tree/main/exams/air...
The question I have is: what are we optimizing for and how do we measure it?
In your own repos, I see you have a fork of safepass, which seems like a nice simple project, but it doesn't have an agents file yet.