Measuring What Matters: Construct Validity in Large Language Model Benchmarks(arxiv.org)1 points by Cynddl 188 days ago | 0 commentsNo comments yet