LLM price vs. performance (Google sheet)(docs.google.com) |
LLM price vs. performance (Google sheet)(docs.google.com) |
All values are sourced externally from publicly available data.
This sheet is only as good as the data I've found for it. Some values change over time (eg 0-100 normalized index), while others have contradictory sources. For example, OpenAI's self-reported metrics for GPT-4-turbo are quite close but not identical between their simple-evals repo[1] and the charts in the GPT-4o announcement[2]. For others, strong benchmark scores are prominent on marketing pages while weaker scores require some digging.
As a general rule of thumb, I've tried to: a) Include every metric I can find to help mitigate cherry-pick bias. b) Resolve conflicts by selecting what I consider to be either the more current or more trustworthy source. For what it's worth, I haven't come across any evaluation discrepancies with a meaningful margin of difference.
The folks I've shared this with so far have found it useful - I hope you do as well!
[1] https://github.com/openai/simple-evals [2] https://openai.com/index/hello-gpt-4o/