I built AptSelect to stop writing throwaway scripts every time I needed to test how different LLMs handle specific instructions and prompt edge cases. What it does: Parallel Execution: Send a single prompt to OpenAI, Anthropic, Mistral, and Gemini simultaneously. Compare the outputs, latency, and exact token usage side-by-side. Batch Evaluations: Upload a CSV dataset to run bulk tests across multiple models at once. Manual Diagnostics: Grade outputs manually (Pass/Fail) and assign diagnostic tags (e.g., Hallucination, Format Error) to build a human-verified performance leaderboard. Local-first: API keys encrypted with your OS keyring; history stored in a local SQLite DB; no telemetry. I’m looking for technical feedback. What do you think current LLM testing/evaluation tools get most wrong? |
No comments yet