I built FC-Eval to have a repeatable way to evaluate how well different LLMs handle function calling before using them in agent workflows. It runs models through 30 test cases covering single-turn, multi-turn, and agentic scenarios, modeled loosely after the Berkeley Function Calling Leaderboard methodology. Validation uses AST matching rather than string comparison to avoid false positives from formatting variations. Supports two backends: OpenRouter for cloud models (GPT-5.2, Claude, Qwen 3.5, Mistral, etc.) and Ollama for local models with no API key needed. Tests for best of N trials giving you a reliable score alongside raw accuracy. Results export to JSON, TXT, CSV, or Markdown. Quick start commands: Via Openrouter: `fc-eval --provider openrouter --models openai/gpt-5.2 anthropic/claude-sonnet-4.6` Via Ollama: `fc-eval --provider ollama --models llama3.2` GitHub repo: https://github.com/gauravvij/function-calling-cli Happy to answer questions, especially around the test case design or validation logic. |