Next Grok model training with 10T parameter model(twitter.com) |
Next Grok model training with 10T parameter model(twitter.com) |
Between moE, aggressive quantization, and synthetic data pipelines, it’s getting harder to tell whether bigger models are actually better, or just more expensive to train.
Would be more interesting to see -> capability per dollar or per watt, not parameter count...