Show HN: Scorecard – Evaluate LLMs like Waymo simulates cars(docs.scorecard.io) Hey HN! I built self-driving sim and eval at Waymo. Now I’m building Scorecard to bring that approach to agent eval: reproducible, automated scoring for AI. Scorecard lets you: - Run LLM-as-judge evals on agent workflows: test tool usage, multi-step reasoning, and task completion in CI/CD or in a playground. - Debug failures with OpenTelemetry traces: see which tool failed, why your agent looped, and where reasoning went wrong. - Collaborate on datasets, simulated agents, and evaluation metrics. Try it out → https://app.scorecard.io (free tier, no payment required!) Docs → https://docs.scorecard.io We’re a small team (4 people), just raised $3.75M, and have early customers using Scorecard for evals in the legal-tech space. We're on a mission to squash non-deterministic bugs. What's the weirdest LLM output you've seen? |
No comments yet