Mamba Explained: The State Space Model Taking On Transformers

Mamba Explained: The State Space Model Taking On Transformers(kolaayonrinde.com)

270 points by koayon 2 years ago | 93 comments

Straw 2 years ago |

The SSMs papers and blogs always have unnecessarily complicated explanations. At this point I almost wonder if its to hide how simple the underlying algorithms are, or to make them seem fancy.

SSMs are doing exponentially weighted moving averages (EMA). That's it- to summarize the past, you take an EMA of a variable output at each time step. Mamba changes one key thing- instead of decaying the past by a fixed amount each step as in a constant-time EMA, we have another output which decides how much to forget, or equivalently, how much 'time' has passed since the last observation in our EMA.

All of the matrix equations, continuous time, discretization, etc, will end up with a dynamic-forgetting EMA as I describe above. This also makes the benefits and limitations clear- finite state size, has to decide at a given layer what to forget before it sees the past at that layer.

logicchains 2 years ago | |

Are there any fundamental differences between Mamba, Retnet and RWKV, or are they all variants of this same architecture?

Straw 2 years ago | | |

No, all of these use the same fundamental architecture with minor tweaks, such as the dynamic gate for mamba or an outer product paramterization of the values for RWKV-v5

ogogmad 2 years ago | |

That might explain the motivation for why the Δ variable is used and varied; but not the "Selectivity", which the article says is expressed by how the matrices B and C vary while consuming input.

Something I've noticed is that B, C and Δ depend only on the current token. See this: https://www.kolaayonrinde.com/blog/images/mamba/ssm_algorith... -- Another thing is that I've noticed that the definition of "SSM" in the image I've linked to is apparently recursive. This is also in the Arxiv paper. Strange.

+1 though for making me go back to the article and read it more carefully! +1 also to the article.

ogogmad 2 years ago | | |

OK, I've noticed that the pseudo-code above is vectorised, and so there's no recursion. The SSM function is actually described at the start of the paper, and an efficient hardware-aware implementation is suggested in section 3: https://arxiv.org/ftp/arxiv/papers/2312/2312.00752.pdf

binarymax 2 years ago | |

I hadn’t heard of Mamba before reading this article, and I was wondering if anyone has tried setting importance of a token as a TF-IDF or BM25 lookup. Requires a first pass to construct the token index but otherwise it seems like it would address the big issue that all these architectures have - they don’t know how “important” a token is. Interestingly this seems to be the crux of Mamba - deciding what tokens to forget! EMA other treats all tokens equally at sequence time. What if the tokens were weighted beforehand and the weights were passed as an attention mechanism? I wonder if anyone has tried something like this.

halflings 2 years ago | | |

The importance (e.g. attention) needs to be dynamic, e.g. one token will be important to some other tokens but not others.

tf-idf and similar heuristics are what we were using before attention came along, e.g. tf-idf weighted bag-of-words representation of word2vec embeddings. That approaches fails in so many cases.

nelsondev 2 years ago | | |

Not exactly related, but in the same vein - Deep Impact - deep learning to find term impacts in the context of their document.

https://arxiv.org/abs/2104.12016

torginus 2 years ago | |

Is this analogous to digital filters, where Transformers are the FIR filters that operate on the history of input, and IIR filters, which take past inputs into account with an exponentially decaying importance?

CrypticShift 2 years ago |

> In other words, you can drag and drop downloaded states into your model, like literal plug-in cartridges

The same could be said of "control vectors" [1]. Both ideas are still experimental, but is seems to me IINM that they could replace "system prompts" and "RAG" respectively.

[1] https://news.ycombinator.com/item?id=39414532

refulgentis 2 years ago | |

Can control vectors replace RAG?

i.e. if I want the model to give me a summary of the news today, and the model was trained before today, can control vectors help?

p1esk 2 years ago | | |

No technique can get you the news other than actually searching for and then parsing the published news.

Der_Einzige 2 years ago | |

Whoever is downvoting this post needs to stop.

The concepts behind control vectors, i.e. "representation engineering" are not especially new and have been highly effective in the diffusion space. I always find it entertaining when LLM folks act like they're discovering stuff that waifu stable diffusion folks knew for 6 months + about - like "concept slider loras".

CuriouslyC 2 years ago | | |

You are right that playing with AI image generation models is really good for building intuition about AI models in general, even if they seem superficially different. It's kind of like surveying a battlefield from the air.

refulgentis 2 years ago | | |

I don't know what you mean, can you help me?

I'm familiar with our intrepid stable diffusion sailors.

I don't know why you think the post is being downvoted.

I don't know why it would be verboten to downvote it, or indicative of the downvoter being an LLM fanatic who thinks they discovered everything.

I am puzzled by the post because it claims RAG can be replaced by control vectors.

I'm also puzzled because it claims prompts can be replaced by control vectors.

I get that if system prompts were only to shift output tone, control vectors could replace that case, but that seems narrow compared to the full set of things prompt input enables (inter alia, the in-context learning)

jncfhnb 2 years ago | | |

Most of these things aren’t much better than a single weighted token though

Der_Einzige 2 years ago |

First it was longformer, and linear attention models. Then it was RWKV and now it's Mamba. So many bombastic claims of improved architectural performance - and no open source models that beat the thing they purport to beat. The proof is always in the pudding, and these models will remain a curiosity for most until their weights are being benchmarked favorably on LLM leaderboards.

imjonse 2 years ago |

Explaining Mamba is a rite of passage, like the monad tutorials of yore.

SkyMarshal 2 years ago | |

Mamba is like a burrito...

kekebo 2 years ago | | |

It gets soggy and disintegrates when not consumed swiftly?

sja 2 years ago | |

Or Balks[0]:

BALK RULES! IMPORTANT! 1. You can’t just be up there and just doin’ a balk like that.

1a. A balk is when you

1b. Okay well listen. A balk is when you balk the

1c. Let me start over

1c-a. The pitcher is not allowed to do a motion to the, uh, batter, that prohibits the batter from doing, you know, just trying to hit the ball. You can’t do that.

1c-b. Once the pitcher is in the stretch, he can’t be over here and say to the runner, like, “I’m gonna get ya! I’m gonna tag you out! You better watch your butt!” and then just be like he didn’t even do that.

1c-b(1). Like, if you’re about to pitch and then don’t pitch, you have to still pitch. You cannot not pitch. Does that make any sense?

1c-b(2). You gotta be, throwing motion of the ball, and then, until you just throw it.

1c-b(2)-a. Okay, well, you can have the ball up here, like this, but then there’s the balk you gotta think about.

1c-b(2)-b. Fairuza Balk hasn’t been in any movies in forever. I hope she wasn’t typecast as that racist lady in American History X.

1c-b(2)-b(i). Oh wait, she was in The Waterboy too! That would be even worse.

1c-b(2)-b(ii). “get in mah bellah” – Adam Water, “The Waterboy.” Haha, classic…

1c-b(3). Okay seriously though. A balk is when the pitcher makes a movement that, as determined by, when you do a move involving the baseball and field of

2. Do not do a balk please.

[0]: https://justinbee.tumblr.com/post/15309101943/best-explanati...

hyperbovine 2 years ago | |

Similar market share too.

behnamoh 2 years ago |

Can the low adoption of Mamba be attributed to what is being discussed today on HN (https://news.ycombinator.com/item?id=39491863)?

Basically, Nvidia et al. don't want the AI research to move in a direction that requires less GPU compute, less training data, and less inference compute.

Someone on HN (I don't remember the name) mentioned that the idea of deep learning is backed by big tech because it benefits them the most as they are the only players in town with huge amounts of data. If the AI community would find entirely different approaches to AGI (maybe not even learning), who do you think would suffer the most from the implications?

kgeist 2 years ago |

If Mamba selectively forgets "unnecessary" details, can it repeat the input verbatim (if asked)?

hackerlight 2 years ago | |

If you ask at the end of the prompt then it may have already deliberately tossed the information it deemed irrelevant prior to the question. These aren't transformers. In general the recall for arbitrary information will be worse.

kgeist 2 years ago | | |

So the questions should come before the content and it might work?

I think that's how also RWKV works.

alok-g 2 years ago |

A potentially naive question. Isn't this modeled like a Kalman Filter?

Edit: Sounds like it is. https://openreview.net/pdf?id=AL1fq05o7H

kken 2 years ago |

There is also this: https://jackcook.com/2024/02/23/mamba.html

fancyfredbot 2 years ago |

thecolorgreen 2 years ago |

Why doesn't Equation 1b use the h' defined in Equation 1a?

koayon 2 years ago | |

Hey! OP here Great question - h' in Equation 1a refers to the derivative of h with respect to time (t). This is a differential equation which we can solve mathematically when we have x in order to get a closed-form solution for h. We would then plug in that h (the hidden state) into equation 1b.

In our case, we don't actually wait for a closed-form solution but instead compute the discrete representation (Equation 2)

Hope that helps!

atlacatl_sv 2 years ago | |

I believe h' is for the next state. y(t) is to predict the next word so it uses the current hidden state h(t).

givemeethekeys 2 years ago |

Transformers, Rise of the Mambas, coming to a theater near you!

vanjajaja1 2 years ago |

I really enjoyed this article, thanks

AndrewKemendo 2 years ago |

Someone is going to re-invent Bellman's equations and call it Learnformer