Author here. I had been fascinated with Andrej Karpathy's article (https://karpathy.github.io/2015/05/21/rnn-effectiveness/) -- especially where it shows neurons being activated in response to brackets and indentation.
I built Ecco to enable examining neurons inside Transformer-based language models.
You can use Ecco to simply interact with a language model and see its output token by token(as it's built on the awesome Hugging Face transformers package). But more interestingly you can use it to examine neuron activations. The article explains more: https://jalammar.github.io/explaining-transformers/
I have a couple more visualizations I'd like to add in the future. It's open source, so feel free to help me improve it.
Thanks for your kind words! It's a labor of passion, honestly. And while in previous years it was a nights-and-weekends project, I have recently been giving it my entire time and focus -- which is why I'm able to dip my toes more heavily into R&D like Ecco and the "Explaining Transformers" article.
Having said that, IANAL, but I find it unlikely that the use of a dolphin and the word Ecco together are not trademarked, so you may want to check on that before someone bugs you about it
I am curious about those recent O(L) attention transformers (see slide 106 of http://gabrielilharco.com/publications/EMNLP_2020_Tutorial__...). If these methods are converging towards a new self-attention mechanism, I'd love to try illustrating that.
What other attention modes are you referring to? Did something in particular catch your attention?
I tried to look at higher level layers, and the grouping were indeed of higher level : for example at level 4 there was a grouping which highlighted for any punctuation (and not just comma). The grouping were also qualifying more : for example ("would deliberately" whereas at lower level it was just would).
But it's not as clear as I had hoped it would be. I hoped it would somehow highlight grouping of higher and higher size, that could nicely map to the equivalent of a parse-tree.
The problem I have with this kind of visualizations, is that they often require interpretation. Also, they don't tell me if the structure was really present by the neural network but was just not apparent because the prism of the Non-negative Matrix Factorization hid it.
For my own networks, instead of visualizing, I like to quantify things a little more. I give the neural network some additional layers, and I try to make the neural network produce the visualization directly. I give it some examples of what I'd like the visualization to look like, and jointly train/fine-tune the neural network so that it solve simultaneously his original task, and the production of the visualization which is then easier to inspect.
Depending on how many additional layers I had to add, and depending on where they were added, and depending on how accurate (measured by a Loss Function!) the network prediction are, I can better infer how it's working internally, and whether or not the network is really doing the work or if it is taking some mental shortcuts.
For example in my Colorify [1] browser extension, which aims to reduce the cognitive load of reading, I use neural networks to predict simultaneously visualizations of sentence-grouping, linguistic features, and even the parse-tree.
[1] https://addons.mozilla.org/en-US/firefox/addon/colorify/
Are there theoretical reason to choose NMF over other dimensionality reduction algorithms, e.g. UMAP?
Is it easy to add other DR algorithms? I may submit a PR adding those in if it is...
It should be easy, yeah. for NMF, the activations vector is reshaped from (layers, neurons, token position) down into (layers/neurons, token position). And we present that to sklearn's NMF model. I would assume UMAP would operate on that same matrix. That matrix is called 'merged_act' and is located here: https://github.com/jalammar/ecco/blob/1e957a4c1c9bd49c203993...
Scroll down to "Factorizing Activations of a Single Layer" in https://jalammar.github.io/explaining-transformers/ to see those.
The figure above it, titled 'Explorable: Ten Activation Factors of XML' shows how neuron firing patterns in response to XML -- opening tags, closing tags, and even indentation.
It's still fresh, but I'm keen to see what other people uncover in their examinations (or what shortfalls/areas of improvement there are for such a method).
I do get your point on interpretation. This work is just a starting point. I'm curious to arrive at ways to automatically select the appropriate number of factors for a specific sequence. Kind of like the elbow method for K-means clustering.
https://arxiv.org/pdf/1703.03130.pdf
It's a bit older now but I was looking for a self attention method without resorting to a transformer model and this proposed an interesting implementation that wound up being very successful for my problem case.
[1]: https://twitter.com/Johannes_Welbl/status/106530965474036121...
Work and articles like yours has truly had an impact on me, even though they are largely qualitative. We always say “Turing complete” this and “Turing complete” that, but theoretical statements such as this have little practical utility to me as we all know that what can be learnt and what is learnt are two very different things. For example, “Visualizing and Understanding Recurrent Networks” by Karpathy et al. (2015) [2] that you list as inspiration blew my mind in terms of for example neurons that monotonically decrease from the sentence start. I remember Karpathy giving a talk on it in London and what struck me was how he simply had gone to manually inspect the neurons manually (heresy!) as there were only a few thousand of them any way. That playfulness, truly admirable.
[2]: https://arxiv.org/abs/1506.02078
Another anecdote, now from “Attention Is All You Need” by Vaswani et al. (2017) [3] where I was far from sold on Transformers as a model until Uszkoreit gave a talk at an invitation-only summit where he showed those cherry-picked attention heads that “flipped” based on whether an object was animate or not. I approached him after the talk and asked why it was not in the paper as it was awesome! Maybe I am biased because I give a large role to intuition in science, but analysis such as this is far more valuable to me as a researcher than yet another point of BLEU or a 10th dataset. Again, my bias, but I feel that there is a need for new ways of thinking in terms of both “hard” empiricism and “soft” analysis in machine learning as we seemingly are now having to mature given the attention we are receiving.
[3]: https://arxiv.org/abs/1706.03762
Apologies if I am rambling, it is midnight now and I barely slept last night.
Thanks for digging up the screenshot. Exploring contextualize word embeddings is truly fascinating. And thanks for sharing your experience!