Hi everyone,
I recently conducted a test to compare the performance of 12 different AI models. Using a custom-made 65-question exam, I evaluated each model's accuracy and compiled the results. I thought it would be interesting to share the findings here and get your thoughts on them.
I was really surprised by some of these results. All the models did quite well, considering the questions were pretty challenging. Even the lowest scores were still at 87.69%. It was fascinating to see open-source models being very competitive with the best models available.
I bought the lifetime deal on chatarena.ai to have an interface to quickly test all of them.
Also, I tested Perplexity Pro, but it performed so poorly that I didn't include it. Maybe the 65-question test was too long for it? Usually, it gives me pretty good results, so I'm not sure what happened there.
In terms of general personal use outside this test, currently, GPT-4, GPT-4o, and Gemini 1.5 are my favorites. I'm still not sure which is better between GPT-4 and GPT-4o as the outputs tend to be quite similar. I used to be really disappointed with Gemini, but I think they've improved it a lot recently. On the other hand, I'm more disappointed with Claude, especially Opus. Yes, it did rank 2nd here, but it constantly messes up in my personal use where GPT and Gemini never do. I let my friends use my subscriptions sometimes, and they agree that GPT and Gemini have been outperforming Opus.
Please share your insights and thoughts on these results!