High-res image reconstruction with latent diffusion models from human brain

High-res image reconstruction with latent diffusion models from human brain(github.com)

459 points by trojan13 3 years ago | 155 comments

Aransentin 3 years ago |

I immediately found the results suspect, and think I have found what is actually going on. The dataset it was trained on was 2770 images, minus 982 of those used for validation. I posit that the system did not actually read any pictures from the brains, but simply overfitted all the training images into the network itself. For example, if one looks at a picture of a teddy bear, you'd get an overfitted picture of another teddy bear from the training dataset instead.

The best evidence for this is a picture(1) from page 6 of the paper. Look at the second row. The building generated by 'mind reading' subject 2 and 4 look strikingly similar, but not very similar to the ground truth! From manually combing through the training dataset, I found a picture of a building that does look like that, and by scaling it down and cropping it exactly in the middle, it overlays rather closely(2) on the output that was ostensibly generated for an unrelated image.

If so, at most they found that looking at similar subjects light up similar regions of the brain, putting Stable Diffusion on top of it serves no purpose. At worst it's entirely cherry-picked coincidences.

1. https://i.imgur.com/ILCD2Mu.png

2. https://i.imgur.com/ftMlGq8.png

sillysaurusx 3 years ago | |

I don’t get the criticism here. Normally I’d be the first to err on the side of skepticism, but this work seems above board.

I think the confusion is that this model is generating “teddy bear” internally, not a photo of a teddy bear. I.e. the diffusion part was added for flair, not to generate the details of the images that exist inside your mind. They could just as easily have run print(“teddy bear”), but they’re sending it to diffusion instead of printing it to console.

The fact that it can correctly discern between a dozen different outputs is pretty remarkable. And that’s all that this is showing. But that’s enough.

It’s not really a “gotcha” to say that it’s showing an image from the training set. They could have replaced diffusion with showing a static image of a teddy bear.

It sounds like this is many readers’ first time confronting the fact that scientists need to do these kinds of projects to get funding. As long as they’re not being intentionally deceptive, it seems fine. There’s a line between this and that ridiculous “rat brain flies plane” myth, and this seems above it.

Disclaimer: I should probably read the paper in detail before posting this, but the criticism of “the building looks like a training image” is mostly what I’m responding to. There are only so many topics one can think about, and having a machine draw a dog when I’m thinking about my dog Pip is some next-level sci-fi “we live in the future” stuff. Even if it doesn’t look like Pip, does it really matter?

Besides, it’s a matter of time till they correlate which parts of the brain are more prone to activating for specific details of the image you’re thinking about. Getting pose and color right would go a long way. So this is a resolution problem; we need more accurate brain sampling techniques, i.e. Neuralink. Then I’m sure diffusion will get a lot more of those details correct.

Aransentin 3 years ago | | |

Because pretty much everybody that reads the article will have taken away a grossly exaggerated idea of what the system is actually capable of. If Stable Diffusion was intentionally added "for flair" and really is unnecessary, then I would absolutely say that the researchers were being intentionally deceptive.

Even if we do a massive goalpost-move and grant that the system is only identifying the label "dog" with a brain scan of a person looking at a dog, we would need to see actual statistics of its labelling accuracy before judging it in that way. If the images in the paper are cherry-picked(1), it could easily be only able to extract a handful of bits to no bits at all, and the entire thing could very well turn out the be replicable from random noise.

(1) Note that the paper even states "We generated five images for each test image and selected the generated images with highest PSMs [perceptual similarity metrics].", so it even directly admits that the presented images are cherry-picked at least once.

Hakkin 3 years ago | |

I'm definitely not an expert in this subject, but even if the model is overfitted, doesn't the fact that it can pull out the similar images at all give credit to the idea that a larger, non-overfitted model could actually work as the paper describes? It means that there does exist some correlation between the shown subject, the captured fMRI data, and the resulting location in latent space.

Double_a_92 3 years ago | | |

The output part is basically nonsense. It would be more honest if the output was a text. E.g. "Teddybear" instead of a bad image of a random teddybear.

hiddencost 3 years ago | | |

Nope.

If you train a model where the input is an integer between 1 and 10, and the output is a specific image from a set of ten, the model will be able to get zero loss on the task. That is what's happening here.

thedudeabides5 3 years ago | | |

Yes.

It means there may be signal in the noise. Even if it's overfitting. Which makes sense.

A sufficiently granular map of the human brain aught to be readable, if you know what the input and output signals are.

chaxor 3 years ago | | |

If things are being overfit you should typically make the model smaller - not larger.

kdma 3 years ago | |

Good find, when I read it I called bullshit but I got lost trying to understand the diagrams. Another gotcha is the semantic decoder, they are just looping the model on itself "A cozy teddy bear" + fMRI random input => A teddy bear!!!

arnarbi 3 years ago | |

Subject 4 in the first line also looks very different from the ground truth, but clearly an airliner. I'm curious if there is also a closer match to that one in the set.

brucethemoose2 3 years ago | |

Its still picking out the correct "overfitted" images, which is remarkable.

Theoretically, the results would scale to more training images... we just need to fMRI all of LAION-5B. Easy peasy.

mkagenius 3 years ago | | |

The only question is whether more images will confuse the model or not?

sampo 3 years ago | |

> The dataset it was trained on was 2770 images, minus 982 of those used for validation.

I don't think you got that 2770 correct. Might be 9250 images, minus 982 (that one you got right). Then again, the paper is so badly written, I find it difficult to decipher what they did. From section 3.1:

Briefly, NSD provides data acquired from a 7-Tesla fMRI scanner over 30–40 sessions during which each subject viewed three repetitions of 10,000 images. We analyzed data for four of the eight subjects who completed all imaging sessions (subj01, subj02, subj05, and subj07).

We used 27,750 trials from NSD for each subject (2,250 trials out of the total 30,000 trials were not publicly released by NSD). For a subset of those trials (N=2,770 trials), 982 images were viewed by all four subjects. Those trials were used as the test dataset, while the remaining trials (N=24,980) were used as the training dataset.

https://www.biorxiv.org/content/10.1101/2022.11.18.517004v2....

2-718-281-828 3 years ago | |

there is also no way that you could represent details as shown with such a small sample.

bawolff 3 years ago | |

Even if true, the result still seems very impressive to me as a layman.

gfaure 3 years ago | | |

That’s the whole problem — that the reconstruction aspect of the contributions seems overstated given only a layperson’s understanding.

SubiculumCode 3 years ago | |

I feel like you might be moving the goal posts here a bit. Getting a reconstruction that is a bear, even if not the same bear, is impressive enough to be noteworthy.

xmonkee 3 years ago | | |

I think the point is that it's not a reconstruction. It's more like recognizing which letter of a thousand-letter alphabet is shown to the human after decoding their brain waves. Still impressive, but not really as impressive as visual reconstruction.

razor_router 3 years ago | |

What evidence do you have that this technique is overfitting the training data rather than reading the brain?

ditchfieldcaleb 3 years ago | |

What are you talking about? They didn't train a model for this. That's why it's so impressive.

Hakkin 3 years ago | | |

Quoting from the paper,

  The only training required in our method is to con-
  struct linear models that map fMRI signals to each LDM
  component, and no training or fine-tuning of deep-learning
  models is needed.
  
  ...
  
  To construct models from fMRI to the components of
  LDM, we used L2-regularized linear regression, and all
  models were built on a per subject basis. Weights were
  estimated from training data, and regularization parame-
  ters were explored during the training using 5-fold cross-
  validation.

2bitencryption 3 years ago |

Are any of the example images novel, i.e. new to the model? Or is the model only reconstructing images it has already seen before?

Either way, if I'm understanding right, it's very impressive. If the only input to the model (after training) is a fMRI reading, and from that it can reconstruct an image, at the very least that shows it can strongly correlate brain patterns back to the original image.

It'd be even cooler (and scarier?) if it works for novel images. I wonder what the output would look like for an image the model had never seen before? Would a person looking at a clock produce a roughly clock-like image, or would it be noise?

All the usual skepticism to these models applies, of course. They are very good at hallucinating, and we are very good at applying our own meaning to their hallucinations.

andai 3 years ago | |

There was a video many years ago (early 2010s?) demoing a similar technology, which would overlay and blend many images on top of each other to make a fuzzy image approximating what was actually being viewed.

Edit: found it! https://youtu.be/nsjDnYxJ0bo

ricudis 3 years ago | | |

The youtube video quotes a paper by the same author, so it's probably the same group's work. I wonder why didn't they used an approach similar to the one in the video using SD - it looks more viable.

crispyambulance 3 years ago |

In 1990, there was a train-wreck Wim Wenders movie that I loved and still love called "Until the End of the World". It was about a scientist (played by Max Von Sydow) who created a machine that could record someone's dreams or visual experiences directly from the brain and play them back even to a blind person. https://youtu.be/gilzgbdk300?t=442

Anyways, the images that were depicted in this work of fiction shot in 1990 about "the future" of 2000, had a very interesting look to them-- kind of distorted and dreamy like the images in the paper.

Are the images in the paper just a case of overfitting? ¯\_(ツ)_/¯ but it still makes me giddy remembering the Wim Wenders film.

dustractor 3 years ago | |

Such a great soundtrack too! I rewatched it last week just for the jams. Also, for those into the glitch art genre, the dream sequences were WAY ahead of their time.

donohoe 3 years ago |

As people and groups increasingly move this direction do we think about vectors for abuse in 10, 20 or 50+ years?

The human mind is considered the only place where we have true privacy. All these efforts are taking that away.

At this rate all notions of privacy will soon be dead.

gus_massa 3 years ago |

In case someone miss it, there is a link to more info https://sites.google.com/view/stablediffusion-with-brain/ and to the preprint https://www.biorxiv.org/content/10.1101/2022.11.18.517004v2

ninesnines 3 years ago |

I am suspicious of these results; if we blast a high frequency visual stimulus of a couple of letters and do quite a lot of post processing we can sometimes get a visual cortex map of those particular letters. However, these paper examples are very complex images and I’m very doubtful of the results - aransentin above made a couple of very valid points

SubiculumCode 3 years ago | |

I mean in 2011, they were already able to do some basic reconstruction: see https://www.youtube.com/watch?v=nsjDnYxJ0bo

I could imagine improvements since then,especially with advances in image networks.

Lutzb 3 years ago |

Reminds me of this paper [1] from 2011. See it in action in [2]

1. https://www.cell.com/current-biology/fulltext/S0960-9822(11)...

2. https://www.youtube.com/watch?v=nsjDnYxJ0bo

Edit: Just realized the paper above is also from Shinji Nishimoto

smusamashah 3 years ago |

There was this research where they reconstructed human face images from monkey brain scan. https://www.electronicproducts.com/scientists-reconstruct-im...

What's astonishing here is the quality of reconstruction. But I have not seen this research referenced a lot. Does someone how /why the reconstruction from monkey brain looks so perfect while we don't have anything close from human brain?

Edit: better images here https://www.newscientist.com/article/2133343-photos-of-human...

drzoltar 3 years ago |

My understanding is that we won’t get a “mind reader” model out of this, because visual stimulus vs your imagination happen in separate parts of the brain. In other words we won’t be reading the minds of suspected criminals anytime soon. Maybe someone with neurology experience can chime in here? Is it even theoretically possible to see what’s happening in the imagination?

wongarsu 3 years ago | |

In the best (worst?) case the method generalizes well, and you could just replace the training set of fMRI scans of people viewing images with fMRI scans of people asked to recall images they were shown previously, or fMRI scans of people told to imagine a scene based on a verbal description. It's rarely that easy though

giantrobot 3 years ago | |

> In other words we won’t be reading the minds of suspected criminals anytime soon.

Oh don't worry, this will get wrapped up in some pseudoscience bullshit and misleading statistics and marketed to law enforcement. But not to worry, at first it'll only be used on real bad criminals. If you have nothing to hide you have nothing to fear!

meindnoch 3 years ago | |

I have a hunch that any sort of mind-reading machine would have to be tailored uniquely to the individual you want to probe. The internal neural representations likely develop uniquely for each individual.

soulofmischief 3 years ago | | |

We'll just train a model on training models.

rvnx 3 years ago |

Creepy and cool at the same time. It goes into the bucket of things that are not ethically right, same ways as implanting chips to read monkeys brains. But technically interesting and well-executed.

00F_ 3 years ago |

here we see, basically, a potential feedback loop. AI tools advance brain science -- more advanced brain science can then inform progress in AI. this is why the situation is dangerous: because people dont think about these feedback loops. people see AI and they move the goalposts and rationalize by saying that "cutting edge AI is still short of AGI so its ok." but most normal people dont think about how AI can be used to create AI or how AI could be used to revolutionize all kinds of fields that then plug back into AI. this is a very dangerous, non-linear space. its not the first non-linear space we have traversed but its certainly the least linear space we have ever entered into and it is the highest stakes humanity has ever or will ever deal with.

even if this is just another bullshit article, im just making a point related to it. people need to be worried about this. for the first time in history, lots of people are now creeped out by AI. but they arent taking action or demanding change. we need regulation, grass-roots efforts to stop AI. even if the only way humanity could abort AI as a concept, or delay it for a significant amount of time, was to return to the iron age, and it certainly isnt the only way, it would be unambiguously worth it, in every way and from every angle.

AI requires large compute. what we are doing now was impossible just 20 years ago. if not 20 then 30. you cant manufacture that kind of compute in your garage. global regulation would take care of it no problem. at the very least it would buy us an enormous amount of time that we could use to figure something else out. people always say that some hold-out country would defy global regulations. they wouldnt defy NATO, let alone a super-global coalition. and the idea of such a group or NATO enforcing compute regulations is not far-fetched whatsoever because the emergence of AGI or even advanced non-AGI goes against the interests of literally every human being. there is no group of humans that benefit from that ultimately. the problem is simply waking people up to this plain fact.

politician 3 years ago |

Show HN: Human Diffusion

Hi everybody! We’re Joe and Ahmed and super thrilled to be launching Human Diffusion today! We’ve built an exciting new image generation system that supports economies in developing nations.

Our product leverages the latent creativity of humanity by directly fitting employees with fMRI rigs and presenting them with text inquiries through our API (JavaScript SDK available, Python soon!). Unlike competing alternatives we preserve human jobs in an era of AI supremacy.

I’d like to address rumors that our facilities amount to slaving brains to machines. This is a gross misunderstanding of the benefits we offer to our staff - they are family. Our 18 hour shifts are finely calibrated based on feedback collected through our API, and any suggestion of exploitation is flatly untrue.

Send us an email (satire@humandiffusion.com) to get early access.

Madmallard 3 years ago |

Couldn't we train an AI with FMRI or EEG with like billions of samples of people thinking and describing what they're thinking about and have it gradually train some level of accuracy?

samuelzxu 3 years ago |

There's also this paper with very similar methodology called Mind-Vis, and also accepted to CVPR 2023. https://mind-vis.github.io/

rvz 3 years ago |

Another small step into creating a worse dystopia than the one we are already living in.

Please continue. /s Governments, three letter agencies and the like would be absolutely excited to see this. The future that no-one has asked for.

exclipy 3 years ago |

In 2004, I wrote a short story about exactly this in high school. Using neural networks to "mind read" visual images from an fMRI scan of a brain. I thought it was farfetched, but look where we are now!

dheera 3 years ago |

I wonder how well this would work with wearable brainwave detectors rather than MRI, seeing as MRI isn't really something I could have at home.

babblingfish 3 years ago | |

By brainwave detector I am going to assume you mean an EEG. An EEG measures electrical activity at the surface of the brain. A fMRI shows the activity of individual neurons of the entire brain in real time. It's sort of an apples to oranges comparison given the tools measure different things at vastly different resolutions.

jejeyyy77 3 years ago | |

any idea what the best consumer brainwave detectors are on the market rn?

oth001 3 years ago | | |

OpenBCI is one. Not cheap from what I've seen.

mrtranscendence 3 years ago | | |

Need answer fast?

bitL 3 years ago |

Can't wait to this becoming one of individual performance metrics recording all brain states all the time (video/audio/etc.) and be a part of regular performance reviews...

ACV001 3 years ago |

This is big thing. Although this particular paper is not big thing, the many related quoted studies, set a trend.

fretime 3 years ago |

I'm looking forward to it. When will the code be released, Thanks

chrstphrknwtn 3 years ago |

I don't see anything "high-res" about the reconstructed images.

_448 3 years ago |

So this is like mind reading?

lazy_moderator1 3 years ago |

not the first time something like this ended up on HN

https://news.ycombinator.com/item?id=33632337

convolvatron 3 years ago |

very curious about the little 'semantic model' at the bottom of the brain. does anyone know how that gets constructed and how it gets fed into the results?