They definitely must be doing some quantization or optimization to meet demand, otherwise why would model performance degrade this much? It's been crazy for me personally
Combining multiple tests on the same leaderboard like this is nonsense, there should be a separate leaderbaord for the new tasks where every model is tested again.
Putting it on the original leaderboard as "Opus 4.6 (April 12)" is so obviously inappropriate that it smells like deception. You could say that the leaderboard is hallucinated.