Show HN: Llmfao – Human-Ranked LLM Leaderboard with Sixty Models(dustalov.github.io) In September 2023, I noticed a tweet [1] on difficulties with LLM evaluation, which resonated with me a lot. A bit later, I spotted a nice LLMonitor Benchmarks dataset [2] with a small set of prompts and a large set of model completions. I decided to make my attempt without running a comprehensive suite of hundreds of benchmarks: https://dustalov.github.io/llmfao/ I also wrote a detailed post describing the methodology and analysis: https://evalovernite.substack.com/p/llmfao-human-ranking [1]: https://twitter.com/_jasonwei/status/1707104739346043143 [2]: https://benchmarks.llmonitor.com/ Unfortunately, I did my analysis before the Mistral AI model was released, but published it after the model was released. I’d be happy to add it to the comparison if I had their completions. |