To make the computations useful, first they trained a conventional digital neural network to predict the outputs given the input controllable parameters. Then they arbitrarily assigned some of the controllable parameters to be the inputs of the neural network and others were arbitrarily assigned to be the trainable weights. Then they used the crystal to run forward passes on the training data. After each forward pass, they used the trained regular neural network to do the reverse pass and estimate the gradients of the outputs with respect to the weights. With the gradients they update the weights just like a regular neural net.
Although the gradients computed by the neural nets are not a perfect match to the real gradients of the physical system (which are unknown), they don't need to be perfect. Any drift is corrected because the forward pass is always run by the real physical system, and stochastic gradient descent is naturally pretty tolerant of noise and bias.
Since they're just using neural nets to estimate the behavior of the physical system rather than modeling it with physics, they can use literally any physical system and the behavior of the system does not have to be known. The only requirement of the system is that it does a complex nonlinear transformation on a bunch of controllable parameters to produce a bunch of outputs. They also demonstrate using vibrations of a metal plate.
Seems like this method may not lead to huge training speedups since regular neural nets are still involved. But after training, the physical system is all you need to run inference, and that part can be super efficient.
This is how ultra short pulses are made when the waves cancel out appropriately. Now I'm not sure if they are training a network to calculate the filter efficiently for even shorter pulses, or if the purpose is supposed to be an optical neural network, or why not both.
You used these words several times, and, considered title "physical neural networks", I always wondered if you mean regular like real, or like artificial. If it's artificial, I'm not sure which one of them is "regular" -- LSTM, full, transformers?
Any type of artificial neural net could be used. LSTM, transformer, convolutional, fully connected, whatever you want.
> using a differentiable digital model, the gradient of the loss is estimated with respect to the controllable parameters.
So e.g. they have a tunable laser that shifts the spectrum of an encoded input based on a set of parameters, and then they update the parameters based on a gradient computed from a digital simulation of the laser (physics aware model).
When I read the headline I imagined they had implemented back propagation in a physical system
> Here we introduce a hybrid in situ–in silico algorithm, called physics-aware training, that applies backpropagation to train controllable physical systems. Just as deep learning realizes computations with deep neural networks made from layers of mathematical functions, our approach allows us to train deep physical neural networks made from layers of controllable physical systems, even when the physical layers lack any mathematical isomorphism to conventional artificial neural network layers.
To my naive understanding, and please someone correct me if I'm wrong, the point is that they are not controlling the parameters that compute the NN forward pass directly (hence "no mathematical isomorphism to conventional NNs"), but "hyper-parameters" that guide the physical system to do so. For example, rotation angles of mirrors, or distance between filters, instead of intensity values of light. This leads to the non-linear transformations happening in situ, while simpler transformations in the backprop are still computed in-silico.
They touch on that by observing you could train a second physical neural network to compute the gradients for the first. So it could all be physical.
> Improvements to PAT could extend the utility of PNNs. For example, PAT’s backward pass could be replaced by a neural network that directly estimates parameter updates for the physical system. Implementing this ‘teacher’ neural network with a PNN would allow subsequent training to be performed without digital assistance.
So you need to use in silico training a at first, but can get rid of it in deployment.
They make this claim first, and cite one source. I haven't heard of this as an issue before. Is there anywhere else I could read more on this?
They can be trained once and then frozen and you can develop new skills by learning control codes (prompts), or adding a retrieval subsystem (search engine in the loop).
If you shrink this foundation model to a single chip, something small and energy efficient, then you could have all sorts of smart AI on edge devices.
Training one of these big models takes 100kWh for 1e19 flops, so that's 100k Wh, 360M Ws, or 360MJ or 3.6 1e8J. 1e8Joules/1e19flops = 1e-11J/flop
Neurons take 1e-8J/spike.[1]
Math check appreciated :)
Does seem plausible to think of a single neuron spike (hodgkin-huxley cable model) being modeled with ~1k flops. Though I'm firmly of the opinion that nobody really knows how the brain works.. the neural spike activity could be pure epiphenomenon.. who knows!
[1] “Finally, the energy supply to a neuron by ATP is 8.31 × 10−9 J. Meanwhile, integrating the total power with respect to time we will get the consumed electric power, which is 8.75 × 10−9 J. This is more energy than the ATP supplied. The energy efficiency is 105.3%. This is an anomaly…” - 2017 Feb 16 Wang, Xu, Institute for Cognitive Neurodynamics, East China University of Science and Technology https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5337805/
The actual limits on DL models (and any simulation or optimization) are: power density and the speed of light, plus the maximum amount of power you can deliver to the area. The speed of light limits how long your cables can be while still doing collective reductions, and the power density limits how much compute power you can fit per unit volume. One could imagine a fully liquid cooled supercomputer at 100MW (located near a very reliable and large power source) with optical fiber interconnect, this would completely change the state of the art in large models overnight.
I cannot cite a source here, but it is generally believed that the actual effective GPU utilization in AI training clusters which are "100% utilized" is actually quite poor - 23%-26% - due to data movement, non-essential serial execution, and and scheduling issues. So at least for now there is low-hanging fruit to improve the performance of the capital expenses.
Long term, though, DL clusters are basically CAPEX and energy limited.
IMHO, for now, return on the investment is not really a limiting factor, but it will become one once the shine is off the field.
As far as I know, we're closest to showing information processing in the visual cortex (which is highly linear) and we're still a long way from knowing how it works at a neural level. But maybe someone here can update on this?
But much of the cortex is highly recurrent (non-linear) and the idea that it's doing something like sending bits between synapses, encoded in spike timing or something.. well, I think that's highly speculative and has plenty of problems. But even if so, that's just "information processing".
I'm personally a fan of electromagnetic theories of consciousness[], where the synaptic activity could be an epiphenomenon of supporting a stand EM field.
[]https://en.wikipedia.org/wiki/Electromagnetic_theories_of_co...
I am not sure how much is known about information processing, but it's clear that motor impulses and sensory information are encoded in the spikes. Higher spike frequency = stronger signal. Synapses are how signals are passed from neuron to neuron.
See more: https://en.wikipedia.org/wiki/Neural_correlates_of_conscious...
Maybe it's more like epiphenomalis-m(a), where -is could be a genetive ending. I.e. the idea of epiphenomena.
Afaik, we have correlative descriptions of what these waves indicate (importantly, they're associated with sleep, consciousness, attentiveness), but no direct mechanical model of them or a clear purpose. So yeah, spike timing could still be used at this level, but it seems other behaviors are also happening that may be more essential to the larger function.
The sentiment that synapses probably don't explain everything is rather common, anyhow. I'm thinking, the way the blood flow literally influences the relevant areas by transporting available energy for example, and neurotransmitters must be a very important point, and how those areas react in case of insufficiency would explain why I become nasty when tired and hungry at the same time.