Show HN: Cobalt – Unit tests for AI agents, like Jest but for LLMs

Show HN: Cobalt – Unit tests for AI agents, like Jest but for LLMs(github.com)

3 points by fdefitte 132 days ago | 0 comments

Hey HN, I built Cobalt, an open-source testing framework for AI agents and LLM apps.

Most eval tools (Braintrust, Arize, LangSmith) want you to live in their UI. Dashboards, manual reviews, clicking through results. That's fine for exploration, but it doesn't catch regressions. We needed something that runs in CI like any other test suite, lives in code, and fails the build when quality drops.

  npm install @basalt-ai/cobalt
  npx cobalt init
  npx cobalt run

Write experiments as code:

  import { experiment, Dataset, Evaluator } from '@basalt-ai/cobalt'

  const dataset = Dataset.fromLangfuse('support-tickets')

  experiment('support-agent', dataset, async ({ item }) => {
    const result = await myAgent(item.input)
    return { output: result }
  }, {
    evaluators: [
      new Evaluator({ name: 'Helpful', type: 'llm-judge', prompt: 'Is this response helpful and accurate? {{output}}' }),
      new Evaluator({ name: 'No hallucination', type: 'llm-judge', prompt: 'Does this contain fabricated info? {{output}}' }),
    ]
  })

`npx cobalt run --ci` exits with code 1 if thresholds are violated. The GitHub Action posts score tables on PRs and auto-compares against base branch.

The part I'm most excited about: Cobalt ships with a built-in MCP server, so you can drive it entirely from Claude Code. Just tell it "compare GPT 5.2 with 5.1 on my support agent" or "run my experiments, find the failing cases, and fix the prompt." It runs the experiments, diffs the results, and iterates on your code without you leaving the terminal. Turns eval from a chore into a conversation.

Pull datasets from Langfuse, LangSmith, Braintrust, or plain JSON/JSONL/CSV. Results stored locally in SQLite. No accounts, no dashboards, no vendor lock-in.

No comments yet