Trust at scale: Auto-evaluation for high-stakes LLM accuracy | Dark Hacker News