{"hookSpecificOutput":{"hookEventName":"PostToolUse","additionalContext":"[learning-opportunities-auto] The user just committed code. Per the learning-opportunities skill, consider whether this is a good moment to offer a learning exercise. If the committed work involved new files, schema changes, architectural decisions, refactors, or unfamiliar patterns, ask the user (one short sentence) if they'd like a 10-15 minute exercise. Do not start the exercise until they confirm. If they decline, note it — no more offers this session."}}Conceptually, you should treat them as incremental software instead of magic you grab from others [1]
The killer feature is that coding harnesses tend to have SkillBuilder agent skills so creating them becomes very easy and you can evolve them.
I recommend you build your own for your particular pain points.
Very simple example [2] showing what another user mentioned around "evals" so that you can really achieve good enough correctness for your automation.
- [1] https://alexhans.github.io/posts/series/evals/building-agent...
- [2] https://alexhans.github.io/posts/series/evals/sketch-to-text...
also you can secure/lockdown tool calls better and make the agents tasks retryable, give it failure modes etc. (not if ur laptop dies during agent work its only god and the agent who know what happened to your code.. oh no wait. the agent needs to just spend 100k tokens to remember where it was (great way to spend ur money).
https://github.com/anthropics/claude-code/blob/main/plugins/...
This frontend design skill that claude uses basically just begs it to pick nice fonts and make the design coherent. No specifics about which fonts or how to make nice color schemes and layout.
> When you complete architectural work (new files, schema changes, refactors), Claude offers optional 10-15 minute learning exercises grounded in evidence-based learning science. The exercises use techniques like prediction, generation, retrieval practice, and spaced repetition to provide you with semi-worked examples from across your own project work.
Confusing name though.
To solve this, I've built an agent-native tool to run evaluations based on merged PRs in your codebase. Basically you can ask Claude to evaluate whether the skill made things better/worse on real tasks, and to then iteratively improve it
Stalking your profile (sorry..) I see you're pretty deep in the eval space, so I'm super curious what your approach has been to being rigorous for things like skill changes?
Examples?
> Generation effect: Accepting generated code and decreasing generating one's own code can skip the active processing that builds understanding.
Holy truth.
Sometimes I have had sessions in which I blindly accepted the code produced by the agent for two hours, but afterwards was not able to create a new context file, having forgotten how my codebase worked. Such skill debt does not appear in the diff – it becomes apparent in situations when you must guide the agent, but cannot do it. Such is the nature of the practice proposed by this skill.
So, if I write my API endpoints a certain way, the skill would describe that specific process. Later, an agent can "see" this skill, load it when it's relevant to current chat context, and then do whatever is instructed.
Similar to "tool calls," but instead of being a function you can call, it's just instructions for how to perform that "skill."
At least for the agent I use (Cline), you can define skills either globally or locally (project level).
ive heard here that that skill loads can have a separate impact on the context like staying past a compact.
if you load a bunch of skills your session might end up with them permanently loaded.
i think they pair well with subagents, since the subagent can load the skill, and once its done with the work, can present just the results, and the orchestrator agent doesnt need to know about it
So the only way I can see what this skill actually looks like is to download and run it myself? No thank you.
https://github.com/DrCatHicks/learning-opportunities/blob/ma...
As I understand, this skill is intended to understand AI-generated code and potentially reduce skill atrophy. So it asks the agent to pause after important milestones (eg: created a file, changed db schema etc ) and ask the user questions about how they would do it.
For me, the main lesson here is seeing and learning from how others are using skills. Yesterday I was watching a Matt Pocock class on using agents and he was also showing off skills, such as how he uses a "grill-me" skill to develop product requirement document. I am certainly not going to do exactly what he does, but I now have my own ideas about how to develop requirements and implement them.
After all, in the word of Anthropic engineers themselves, Claude is like a talented engineer, but lacks expertise. Skills are folders and files that build expertise. Another important thing I leaned from Pocock is that the longer the context (or token size), the dumber the responses tend to get. So skills are another way to present the problem to an LLM in a compact manner and get optimized response.
Claude also has behavioral traits. So if someone iteratively builds a skill, it is most likely not going to port well to another user, because each of us chat differently. This is why I hesitate to share my skill folder with my colleagues. But I will certainly demo what I built so that they can see what's possible and figure out their own workflows.
So the value is in seeing how someone else builds using Claude, and imitate in your own way. Very much like when I first learned programming, I was copying code form Kernighan and Richie's C book, but then changing up things to understand how it works and later customize the code for my purpose.
I mentioned behavioral traits for another reason- the author is a psychologist and it is really interesting to see how she interacts with Claude, which is probably very different from how programmers use Claude. Tangentially, she (and a host of other experts in the field) left Twitter long time ago. I'm going to install bsky/mastodon and follow them, because I think it's important to watch how expert non-programmers are using LLMs.
https://github.com/SimHacker/moollm/blob/main/skills/skill/S...
I want to learn Java spring, and probably let ai help me / quiz me. I will take a look into the skills for inspiration.
If you want to learn how Spring framework and Spring boot works, the best thing to do is build your own library and then learn to add it to a new spring boot service.
https://www.baeldung.com/spring-boot-custom-starter
Depending on which AI tool you are using, you can also get it to debrief what it is doing and what layer of the Spring architecture it is using (lifecycle, bean scope, is it using auth/messaging/data middleware etc)
Also here is a service I have built with Claude code along with a sample Spring boot service
https://github.com/tomaytotomato/spring-data-solr-lazarus
It is a demo to show that I could get Apache Solr working in the latest version of Spring Framework 7 and Spring Boot 4. There is a sample application in there for a bookstore you can play around with.
Current plan is to use a existing vue/typescript browser game as frontend, send high score and similar via web sockets. Do ~something~ with red panda to tip my toes into the Kafka world.
I know I sometimes get demotivated mid-way, but that also tells me it might not be worth the investment
I still don't see why AI would be mandatory. It's helpful, yes, but not mandatory.
I want to make an spring app, but instead of looking everything up on Google, I can ask the Ai with context and maybe give me an learning plan that fits my needs
IMHO, if you're working on large feature changes, before nudging the agent to write any code, it's best to:
1. establish consensus, just in the chat, on the problem domain — i.e. the business-domain problem you're solving (as if the agent is your contact at a software-development contractor, and you're sitting down to pin down what you want from them)
2. co-write with the agent a hierarchical-bullet-pointed design document (this should be an actual .md file, not just in the chat) — letting the agent generate + edit most of this, but nitpicking it thoroughly for problems and decision-vagueness, forcing all design-level decisions to be made up-front here
3. tell it to translate the design spec into a skeleton for a BDD spec test suite, to be populated as it implements
4. let the agent free to actually do the impl — where the agent is free to add/modify/delete unit tests and integration tests and so on, but where it must keep the design-spec file and the structure of the derived BDD spec tests fixed (and, before considering itself done, ensure that A. the BDD spec tests are all fleshed out with proper logic reflecting their labels, and 2. they all pass.)
5. At this point you might be done. But if your project is absolutely huge, you might do another "sprint" at this point, starting again from the top by defining new business requirements, amending the design, getting the agent to add to the BDD suite, etc. (Or, if you want to talk everything out up front, you'd insert a step between 2 and 3 of "breaking the design down into milestones" — where the agent will only create BDD spec items for the current milestone, solve for them, gets approval, and then move onto the next milestone.)
Yes, I'm basically saying you should do waterfall with LLMs. Waterfall can actually be rather pleasant, when the whole process happens over the course of an hour.
And the key point here, for understanding: after the project (or after each milestone for a large project), you can have the agent walk you through the code it wrote, explaining it to you in the chat — with the constraint that it shouldn't bother to explain anything already "implied" by the design.
You can then have it turn this explanation of "the surprising parts" into code comments — and the resulting comments would actually be of the kind humans would write, rather than being pro-forma garbage!
While building https://www.agentkanban.io (a Github CoPilot integrated task board), I experimented a lot with instruction placement. A single degree of separation from AGENTS.md works really well (I needed a robust means of having the agent pick up task specific IDs and so settled on a file called INSTRUCTION.md in a file managed by the tool which avoids polluting AGENTS.md as much as possible). I experimented with skills, but they were skipped too often for the tool to work as reliably as it now does.
1. Less is better. A project rarely needs more than a few skills. A skill is best when the output is measurable and clearly defined. The size of the skill is also very important, since shorter ones are easier to actively maintain and for the agent to reliably follow.
2. Context is important. I keep a short knowledge map in my AGENTS.md file, which gives the agent the context it needs for the overall workflow.
3. Frontmatters work surprisingly well. It pairs nicely with agents and has given me good results (though this might be somewhat of a byproduct).
4. Consistency matters. All skills should follow the same format. For example, I strip all Markdown formatting and enforce a very specific format to them. If you import a skill, do change its format with yours.
I would also go and say not to mistake skills for prompts, but that depends on what you deem the ideal workflow.
I also have an .agents/rules/init.md with the following prompt:
"At the start of every chat or task, you MUST read the following file:
- [AGENTS.md](@AGENTS.md)". Most harnesses find this automatically, and I just give the file to those that don't.
Overall, I’ve found that a project usually only needs the AGENTS.md file and an .agents directory (prompts/, rules/, skills/).
I would love to hear other opinions on the things I just said.
The project in question: https://codeberg.org/hydrafog/kanban (agent-first task manager for the terminal)
e.g.:
/goThis is an interesting skill plugin for me because I actually face this inverse problem a fair amount where you want to teach people about a repo and the skills associated with it so they understand the intent behind things quickly. Seeing a bunch of skill commands and behaviors doesn’t always make clear why things are the way they are. The people on the other end need context, and the rapidity with which you can create fairly complex stuff means you need a faster way than “three months of onboarding” to get people up to speed.
That's fair but I think this is similar to power tools like vim, obsidian or others. There's the path of grabbing other people's workflows and not being able to modify them to really tailor the tool to your needs and there's the minimal incremental path that empowers you and gives you control all the way through. It gets you to understand the tools and you'll be able to think possibilities that match your exact problems.
I'm not dogmatic about it but I do really recommend it. You can see the transformative shift once people start "skill building" instead of "skill consuming".
Edit: The approach I mention works with non engineers/developers. So there's no different technical bar.
I looked superficially at your site/repo and based on that initial impression:
- Your approach of comparing different parts of the "black box" which affects agent behaviour (Harness, foundation model, skills, context (in your case the loaded on AGENTS.md context) is closely aligned with how I both think and operate. - You're both tackling the "regression" and the "answer hypothesis easily" problems.
> Stalking your profile (sorry..) I see you're pretty deep in the eval space, so I'm super curious what your approach has been to being rigorous for things like skill changes?
It depends on the level of automation and risk profile. For skills I use this framework of thinking [1] and encourage evals/ground truth as soon as possible so that you can have automatic feedback loops for the markdown part and for the deterministic part (scripts). Once you have the eval/ground truth pair, you're almost doing TDD or Eval Driven Development (which is quite hard the first times you try and realize you actually need to think about intent). The scripts should definitely have their own unit tests for the "skill iteration" in the event that a mutation is desired to cover new behaviour/fix wrong behaviour.
On Agent Skills, it may seem tempting to want more "openness" for the AI to solve the problem creatively but, more often than not, you've described a repeatable workflow and you want predictability and stability instead of novelty so it's really about 1) How can I freeze it to keep being good enough as much as possible 2) How can I know if something happened somewhere which changed the black box (e.g. coding harness auto model picking screws things up 3) How can I make the skill itself ETC (Easy to change), to keep control. Local Models can be a great tool for stability in some scenarios.
In particular, I prefer pass/fail (binary) outcomes instead of scoring which doesn't help regression decisions. Defining "good enough" should be very clear. Flakiness is not a good thing to accept, if the outcomes are consequential.
Anything actually risky should be solid RBAC/policy which doesn't really depend on the LLM.
I had a site that I didn't manage to make visible in HN to create a community for ai-evals.io. I've since interacted with a few people, developed further insights and given some private talks but need to get back to publishing outfacing and trying to contact more people interested in this space because it's absolutely critical. There's a lot of nuance in how different environments think about the eval problem differently: It's all about tracing and course correcting after launch, it's about simulations, sandboxing, security, automatic eval generation, etc.
In any case, I'll try to be more present from now on, and especially from June onwards to try to exchange insights in the open with people who are exploring different solutions in this space.
[1] - https://alexhans.github.io/posts/series/evals/building-agent...
Tailor things to make them your own. First, imagine the kind of workflow you want. Take a pen and a paper and map it out. Second, decide on a format for your agent-related files. Third, offer the prior information to an agent to create a few general iterations. Read them and ask yourself something along the lines of: "Will cutting this affect the workflow?". If not, cut it out.
Instead of jumping and copying the first thing you find, gather knowledge by reading different workflows first. Don't limit yourself to a single field, either. For example, learning about Bloom's Taxonomy can change how you view certain things, so expand your horizons.
Take small steps. It takes a lot of experimentation to reach a good enough workflow.
Everything can be an entry point and it's often non-obvious how things are structured.
More opinionated frameworks which enforce routes and consumers to be centrally managed are generally easier to figure out from the filesystem.
But if you've got an IDE like intellij you get the entry point tool which lists all endpoints. Consumers are more annoying...