The Economics of Speculative Decoding(fergusfinn.com) |
The Economics of Speculative Decoding(fergusfinn.com) |
Whether this is true depends on what you mean by small. In general, AIUI you don't need more than a handful of experts to get a meaningful probability of overlap. DeepSeek V4 Pro is an exceptionally sparse model and even there you start to get meaningful overlap for a batch size of 5 or more. Moreover, in general you can think of the average amount of activated experts for a batch of size b as being n(1 - (1 - k/n)^b) where k is the number of active and n of total experts. For DeepSeek V4, k=6 and n is 256 for Flash, 384 for Pro. (The sampling is repeated per layer, not just per token.)
good point tho - plus for Deepseek the shared expert increases the overlap slightly