Scaling RNNs to Billions of Parameters with Zero Order

Scaling RNNs to Billions of Parameters with Zero Order(arxiv.org)

7 points by fchaubard 1 year ago | 3 comments

Obviously the authors emphasize that it can make RNNs a competitor for big transformers, but it also means you can do things like feed back part of the output of a transformer into the input of the transformer at the next step, or other ways of making transformers into RNNs, so RNNs don't have to be all about speed.

I think this has every chance of being an enabler for much more powerful architectures.

Depth of a transformer is the number of layers. Depth of a transformer with a recurrent connection from the previous token output to the current input is the number of layers times the timestep.

If it works as well as I imagine it's going to make for much more powerful models.

fchaubard 1 year ago | |

exactly

fchaubard 1 year ago |

Layman Abstract: Transformers keep around all previous tokens for each generated token, so they take up ENORMOUS gpu memory and cost during inference. But humans do not, we page in / out of our small, fixed-size "working memory", keeping around only the important information of the past.

RNNs are more like us, they compress all previous tokens into a small fixed-sized memory. However, we can't train them with legacy backprop through time (BPTT), because it doesnt scale and suffers exploding/vanishing gradients.

So we discovered a 1992 zero order algorithm to replace BPTT, and not only does it scale amazingly well, in some cases, it trains 19x faster than BPTT! So maybe with this, RNNs can replace transformers?