I think this has every chance of being an enabler for much more powerful architectures.
Depth of a transformer is the number of layers. Depth of a transformer with a recurrent connection from the previous token output to the current input is the number of layers times the timestep.
If it works as well as I imagine it's going to make for much more powerful models.
RNNs are more like us, they compress all previous tokens into a small fixed-sized memory. However, we can't train them with legacy backprop through time (BPTT), because it doesnt scale and suffers exploding/vanishing gradients.
So we discovered a 1992 zero order algorithm to replace BPTT, and not only does it scale amazingly well, in some cases, it trains 19x faster than BPTT! So maybe with this, RNNs can replace transformers?