Neural Network Diffusion

223 points by vagabund 2 years ago | 86 comments

vessenes 2 years ago |

I wasn't sure if this paper was parody on reading the abstract. It's not parody. Two things stand out to me: first is the idea of distilling these networks down into a smaller latent space, and then mucking around with that. That's interesting, and cross-sections a bunch of interesting topics like interpretability, compression, training, over- and under-.. The second is that they show the diffusion models don't just converge on similar parameters as the ones they train against/diffuse into, and that's also interesting.

I confess I'm not sure what I'd do with this in the random grab bag of Deep Learning knowledge I have, but I think it's pretty fascinating. I might like to see a trained latent encoder that works well on a bunch of different neural networks; maybe that thing would be a good tool for interpreting / inspecting.

daxfohl 2 years ago | |

Seems like it could be useful for resizing the networks, no? Start with ChatGPT 4 then release an open version of it with much fewer parameters.

Or maybe some metaparameter that mucks with the sizes during training produces better results. Start large to get a baseline, then reduce size to increase coherence and learning speed, then scale up again once that is maxed out.

SubiculumCode 2 years ago | |

Perhaps doing this to generate 10 similar but different versions of a model can then be fed into mixture of experts?

vessenes 2 years ago | | |

Ooh that’s a good idea! Although mistral seems to have been seeded with identical copies of mistral, so maybe it doesn’t buy you much? Sounds worth trying though!

daxfohl 2 years ago | | |

Or a good way to teleport out of local minima while training. Create a few clones and take the one with the steepest gradients.

namibj 2 years ago | |

Hmmm, I could think of using it to update a DDPM with a conditioning input as the dataset expands from an RL/online process, without ruining the conditioning mechanism that's only trainable through the actual RL itself.

I.e., self-supervised training is done to produce semantically sensical results, and the RL-trained conditioning input steers to contextually useful results.

(Btw., if anyone has tips on how to not wreck the RL training's effort when updating the base model with the recently encountered semantically-valid training samples that can be used self-supervised, please tell. I'd hate to throw away the RL effort expended to aquire that much taking data for good self-supervised operation. It's already looking fairly expensive...)

daxfohl 2 years ago | |

You could use this and try to tease out something similar to https://news.ycombinator.com/item?id=39487124, but for NNs instead of images. Maybe it's possible to have this NN diffusion model explain the pieces of the NN they generate and why parameters have those values.

If we can get that, then maybe we don't even need to train anymore; it'd be possible to start to generate NNs algorithmically.

gwern 2 years ago |

This doesn't seem all that impressive when you compare it to earlier work like 'g.pt' https://arxiv.org/abs/2209.12892 Peebles et al 2022. They cite it in passing, but do no comparison or discussion, and to my eyes, g.pt is a lot more interesting (for example, you can prompt it for a variety of network properties like low vs high score, whereas this just generates unconditionally) and more thoroughly evaluated. The autoencoder here doesn't seem like it adds much.

vagabund 2 years ago |

Author thread: https://twitter.com/liuzhuang1234/status/1760195922502312197

squigz 2 years ago | |

Is there any sites for viewing Twitter threads without signing up?

f_devd 2 years ago | | |

https://nitter.esmailelbob.xyz/liuzhuang1234/status/17601959...

(bit of trial and error from https://github.com/zedeus/nitter/wiki/Instances)

falcor84 2 years ago |

Seems like we're getting very close to recursive self-improvement [0].

[0] https://www.lesswrong.com/tag/recursive-self-improvement

goggy_googy 2 years ago |

"We synthesize 100 novel parameters by feeding random noise into the latent diffusion model and the trained decoder." Cool that patterns exist at this level, but also, 100 params means we have a long way to go before this process is efficient enough to synthesize more modern-sized models.

Scene_Cast2 2 years ago |

Yay, an alternative to backprop & SGD! Really interesting and impressive finding, I was surprised that the network generalizes.

justanotherjoe 2 years ago |

fuck. I have an idea just like this one. I guess it's true that ideas are a dime a dozen. Diffusions bear a remarkable similarity to backpropagation to me. I thought that it could be used in place of it for some parts of a model.

Furthermore, I posit that resnet especially in transformers allows the model into a more exploratory behavior that is really powerful, and is a necessary component of the power of transformers. Transformers is just such a great architecture the more i think about it. It's doing so many things so right. Although this is not really related to the topic.

crotchfire 2 years ago | |

Actually it is related.

Transformers are just networks that learn to program the weights of other networks [1]. In the successful cases the programmed network has been quite primitive -- merely a key-value store -- in order to ensure that you can backpropagate errors from the programmed network's outputs all the way to the programmer network's inputs.

The present work extends this idea to a different kind of programmed network: a convolutional image-processing network.

There are many more breakthroughs to be achieved along this line of research -- it is a rich vein to mine. I believe our best shot at getting neural networks to do discrete math and symbolic logic, and to write nontrivial computer programs, will result from this line of research.

[1] https://arxiv.org/abs/2102.11174

goggy_googy 2 years ago |

Important to note, they say "From these generated models, we select the one with the best performance on the training set." Definitely potential for bias here.

nerdponx 2 years ago | |

I'd have liked to see the distribution of generated model performance.

QuadmasterXLII 2 years ago | | |

Fig 4b

marojejian 2 years ago |

Am i missing something, or is this just a case of "amortized inference", where you train a model (here a diffusion one), to infer something that was previously found via optimization procedure? (here NN parameters).

jackblemming 2 years ago |

The state of art neural net architecture, whether that be transformers or the like, trained on self play to optimize non-differentiable but highly efficient architectures is the way.

hackerlight 2 years ago | |

According to Hinton, before transformers were shown to work well, learning model architectures was Google's main focus

hoc 2 years ago |

Hm, so does this actually improve/condense the representation for certain applications or is this some more some kind of global expand and collect in network space?

jarrell_mark 2 years ago |

Can this be used to fill in the missing information on the openworm nematode 302 neurons brain simulator?

amelius 2 years ago |

Why does Figure 7 not include a validation curve (afaict only the training curve is shown)?

nullc 2 years ago |

heh https://news.ycombinator.com/item?id=39208213#39211749

HanClinto 2 years ago | |

hah, nice! :D

t_serpico 2 years ago |

i'd wager that adding noise to the weights in a principled fashion would accomplish something similar to this.

jerpint 2 years ago | |

I would really be surprised if just adding noise would give you convergence