Diffusion based alternative to self attention(github.com) |
Diffusion based alternative to self attention(github.com) |
Some of the alternatives I am about to consider:
1. Diffusion with sparse attention layers. 2. Hierarchical diffusion - next token diffusion combined with higher order chunk diffusion.
Still figuring out the code and I would love any feedback on these approaches.