Ask HN: Are LLMs getting better, how can you tell? So there are so many benchmarks out there to evaluate models. ARC-AGI, frontier math, MMLU, Berkeley Function calling and many many more. And I guess the all together, general idea behind all these is to “approximate” all possible types of problems that can be tokenized and solved by an LLM. That said, I can’t seem to do better than just “vibes”. Basically, oh this model gave me a good response to this question, it must be better. Now I have tried keeping track of a couple benchmarks like the ones I mentioned above. But I generally can’t translate these benchmarks into utility outside of the small scope the benchmark test for. Also there are so many benchmarks to keep track of and each takes some learning to understand. So perhaps my scope isn’t well enough defined. But as a programmer, everything >GPT4o feels pretty damn similar. Would love to hear how others evaluate LLMs beyond “just vibes” generally for programming use, but also when trying to use create new ai projects. |