For example, some time ago Mamba-3B matched performance of Pythia-7B: https://www.reddit.com/r/singularity/comments/18asto2/announ...
The main drawback of the legacy models was that they were hard to scale, both due to slow training (sequential processing) and poor stability with depth increase (i.e. poor gradient flow). Modern implementation no longer have these problems and able to scale decently.