Hey HN, I built Cobalt, an open-source testing framework for AI agents and LLM apps. Most eval tools (Braintrust, Arize, LangSmith) want you to live in their UI. Dashboards, manual reviews, clicking through results. That's fine for exploration, but it doesn't catch regressions. We needed something that runs in CI like any other test suite, lives in code, and fails the build when quality drops.
Write experiments as code:
`npx cobalt run --ci` exits with code 1 if thresholds are violated. The GitHub Action posts score tables on PRs and auto-compares against base branch.The part I'm most excited about: Cobalt ships with a built-in MCP server, so you can drive it entirely from Claude Code. Just tell it "compare GPT 5.2 with 5.1 on my support agent" or "run my experiments, find the failing cases, and fix the prompt." It runs the experiments, diffs the results, and iterates on your code without you leaving the terminal. Turns eval from a chore into a conversation. Pull datasets from Langfuse, LangSmith, Braintrust, or plain JSON/JSONL/CSV. Results stored locally in SQLite. No accounts, no dashboards, no vendor lock-in. |