Top AI models fail at >96% of tasks(zdnet.com) |
Top AI models fail at >96% of tasks(zdnet.com) |
Models released a few days ago, Opus 4.6 and GPT 5.3, haven't been tested yet, but given the performance on other micro-benchmarks, they will probably not be much different on this benchmark.
One of the tasks was "Build an interactive dashboard for exploring data from the World Happiness Report." -- I can't imagine how Opus4.5 could've failed that.
It takes a lot to just be mediocre. Which, don't get me wrong, I'll agree current ML is, it's just that "mediocre" is an incomprehensibly huge step up from "random".
Then go ahead and use AI to fix this: https://gitlab.gnome.org/GNOME/mutter/-/issues/4051