Vision language models are blind

Vision language models are blind(vlmsareblind.github.io)

451 points by taesiri 1 year ago | 191 comments

Entertaining, but I think the conclusion is way off.

> their vision is, at best, like that of a person with myopia seeing fine details as blurry

is a crazy thing to write in an abstract. Did they try to probe that hypothesis at all? I could (well actually I can't) share some examples from my job of GPT-4v doing some pretty difficult fine-grained visual tasks that invalidate this.

Personally, I rate this paper [1], which makes the argument that these huge GenAI models are pretty good at things - assuming that it has seen a LOT of that type of data during training (which is true of a great many things). If you make up tasks like this, then yes can be REALLY bad at them, and initial impressions of AGI get harder to justify. But in practice, we aren't just making up tasks to trip up these models. They can be very performant on some tasks and the authors have not presented any real evidence about these two modes.

[1] https://arxiv.org/abs/2404.04125

SomaticPirate 1 year ago | |

There are quite a few "ai apologists" in the comments but I think the title is fair when these models are marketed towards low vision people ("Be my eyes" https://www.youtube.com/watch?v=Zq710AKC1gg) as the equivalent to human vision. These models are implied to be human level equivalents when they are not.

This paper demonstrates that there are still some major gaps where simple problems confound the models in unexpected ways. These is important work to elevate otherwise people may start to believe that these models are suitable for general application when they still need safeguards and copious warnings.

benreesman 1 year ago | | |

If we're throwing "citation needed" tags on stuff, how about the first sentence?

"Large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini-1.5 Pro are powering countless image-text processing applications"

I don't know how many a "countless" is, but I think we've gotten really sloppy in terms of what counts for LLMs as a demonstrated, durable win in a concrete task attached to well-measured outcomes and holding up over even modest periods of time.

This stuff is really promising and lots of builders are making lots of nifty things, so if that counts as an application then maybe we're at countless, but in the enterprise and in government and in refereed academic literature we seem to be at the proof-of-concept phase. Impressive chat bots as a use case are pretty dialed in, enough people claim that they help with coding that I tend to believe it's a real thing (I never seem to come out ahead of going directly to the source, StackOverflow).

The amount of breathless press on this seems "countless", so maybe I missed the totally rigorous case study on how X company became Y percent more profitable by doing Z thing with LLMs (or similar), and if so I'd be grateful for citations, but neither Google nor any of the big models seem to know about it.

pixl97 1 year ago | | |

Yea, really if you look at human learning/seeing/acting there is a feedback loop that LLM for example isn't able to complete and train on.

You see an object. First you have to learn how to control all your body functions to move toward it and grasp it. This teaches you about the 3 dimensional world and things like gravity. You may not know the terms, but it is baked in your learning model. After you get an object you start building a classification list "hot", "sharp", "soft and fuzzy", "tasty", "slick". Your learning model builds up a list of properties of objects and "expected" properties of objects.

Once you have this 'database' you create as a human, you can apply the logic to achieve tasks. "Walk 10 feet forward, but avoid the sharp glass just to the left". You have to have spatial awareness, object awareness, and prediction ability.

Models 'kind of' have this, but its seemingly haphazard, kind of like a child that doesn't know how to put all the pieces together yet. I think a lot of embodied robot testing where the embodied model feeds back training to the LLM/vision model will have to occur before this is even somewhat close to reliable.

sweezyjeezy 1 year ago | | |

The paper I linked should hopefully mark me out as far from an AI apologist, it's actually really bad news for GenAI if correct. All I mean to say is the clickbait conclusion and the evidence do not match up.

lynx23 1 year ago | | |

Be My Eyes user here. I disagree with your uninformed opinion. Be My Eyes is more often than not more useful then a human. And I am reporting from personal experience. What experience do you have?

kenjackson 1 year ago | | |

Simple is a relative statement. There are vision problems where monkeys are far better than humans. Some may look at human vision and memory and think that we lack basic skills.

With AINwe are creating intelligence but with different strengths and weaknesses. I think we will continue to be surprised at how well they work on some problems and how poor they do at some “simple” ones.

brookst 1 year ago | | |

I don’t see Be My Eyes or other similar efforts as “implied” to be equivalent to humans at all. They’re just new tools which can be very useful for some people.

“These new tools aren’t perfect” is the dog bites man story of technology. It’s certainly true, but it’s no different than GPS (“family drives car off cliff because GPS said to”).

Lerc 1 year ago | | |

I disagree. I think the title, abstract, and conclusion not only misrepresents the state of the models but it misrepresents Thier own findings.

They have identified a class of problems that the models perform poorly at and have given a good description of the failure. They portray this as a representative example of the behaviour in general. This has not been shown and is probably not true.

I don't think that models have been portrayed as equivalent to humans. Like most AI in it has been shown as vastly superior in some areas and profoundly ignorant in others. Media can overblow things and enthusiasts can talk about future advances as if they have already arrived, but I don't think these are typical portayals by the AI Field in general.

Melomololotolo 1 year ago | | |

Ah yes the blind person who constantly needs to know if two lines intersect.

Let's just ignore what a blind person normally needs to know.

You know what blind people ask? Sometimes there daily routine is broken because there is some type of construction and models can tell you this.

Sometimes they need to read a basic sign and models can do this.

Those models help people already and they will continue to get better.

I'm not sure if I'm more frustrated how condescending the authors are or your ignorance.

Valid criticism doesn't need to be shitty

TeMPOraL 1 year ago | |

Entertaining is indeed the right word. Nice job identifying corner cases of models' visual processing; curiously, they're not far conceptually from some optical illusions that reliably trip humans up. But to call the models "blind" or imply their low performance in general? That's trivially invalidated by just taking your phone out and feeding a photo to ChatGPT app.

Like, seriously. One poster below whines about "AI apologists" and BeMyEyes, but again, it's all trivially testable with your phone and $20/month subscription. It works spectacularly well on real world tasks. Not perfectly, sure, but good enough to be useful in practice and better than alternatives (which often don't exist).

FrenchDevRemote 1 year ago | |

> their vision is, at best, like that of a person with myopia seeing fine details as blurry

It's not that far from reality, most models sees images in very low resolution/limited colors, so not so far from this description

vikramkr 1 year ago | | |

They didn't test that claim at all though. Vision isn't some sort of 1D sliding scale with every vision condition lying along one axis.

First of all myopia isn't 'seeing fine details as blurry' - it's nearsightedness - and whatever else this post tested it definitely didn't test depth perception.

And second - inability to see fine details is a distinct/different thing from not being able to count intersections and the other things tested here. That hypothesis, if valid, would imply that improving the resolution of the image that the model can process would improve its performance on these tasks even if reasoning abilities were the same. That - does not make sense. Plenty of the details in these images that these models are tripping up on are perfectly distinguishable at low resolutions. Counting rows and columns of blank grids is not going to improve with more resolution.

I mean, I'd argue that the phrasing of the hypothesis ("At best, like that of a person with myopia") doesn't make sense at all. I don't think a person with myopia would have any trouble with these tasks if you zoomed into the relevant area, or held the image close. I have a very strong feeling that these models would continue to suffer on these tasks if you zoomed in. Nearsighted != unable to count squares.

blackmesaind 1 year ago | | |

My thoughts as well. I too would have trouble with the overlapping lines tests if all the images underwent convolution.

csomar 1 year ago | |

> these huge GenAI models are pretty good at things

Is this the sales pitch though? Because 15 years ago, I had a scanner with an app that can scan a text document and produce the text on Windows. The machine had something like 256Mb of RAM.

Tech can be extremely good at niches in isolation. You can have an OCR system 10 years ago and it'll be extremely reliable at the single task it's configured to do.

AI is supposed to bring a new paradigm, where the tech is not limited to the specific niche the developers have scoped it to. However, if it reliably fails to detect simple things a regular person should not get wrong, then the whole value proposition is kicked out of the window.

jrflowers 1 year ago | |

>I could (well actually I can't)

I like the idea that these models are so good at some sort of specific and secret bit of visual processing that things like “counting shapes” and “beating a coin toss for accuracy” shouldn’t be considered when evaluating them.

valine 1 year ago | | |

LLMs are bad at counting things just in general. It’s hard to say whether the failures here are vision based or just an inherent weakness of the language model.

vikramkr 1 year ago | | |

Those don't really have anything to do with fine detail/nearsightedness. What they measured is valid/interesting - what they concluded is unrelated.

godelski 1 year ago | |

> Did they try to probe that hypothesis at all?

I think this is a communication issue and you're being a bit myopic in your interpretation. It is clearly an analogy meant for communication and is not an actual hypothesis. Sure, they could have used a better analogy and they could have done other tests, but the paper still counters quite common claims (from researchers) about VLMs.

> I could (well actually I can't) share some examples from my job of GPT-4v doing some pretty difficult fine-grained visual tasks that invalidate this.

I find it hard to believe that there is no example you can give. It surely doesn't have to be exactly your training data. If it is this good, surely you can create an example no problem. If you just don't want to, that's okay, but then don't say it.

But I have further questions. Do you have complicated prompting? Or any prompt engineering? It sure does matter how robust these models are to prompting. There's a huge difference between a model being able to accomplish a task and a model being able to perform a task in a non-very-specific environment. This is no different than something working in a tech demo and not in the hand of the user.

> But in practice, we aren't just making up tasks to trip up these models.

I see this sentiment quite often and it is baffling to me.

First off, these tasks are not clearly designed to trick these models. A model failing at a task is not suddenly "designed to trick a model." Its common with the river crossing puzzles where they're rewritten to be like "all animals can fit in the boat." If that is "designed to trick a model", then the model must be a stochastic parrot and not a generalist. It is very important that we test things where we do know the answer to because, unfortunately, we're not clairvoyant and can't test questions we don't know the answer to. Which is the common case in the real world usage.

Second, so what if a test was designed to trick up a model? Shouldn't we be determining when and where models fail? Is that not a critical question in understanding how to use them properly? This seems doubly important if they are tasks that humans don't have challenges with them.

> They can be very performant on some tasks and the authors have not presented any real evidence about these two modes.

I don't think people are claiming that large models can't be performant on some tasks. If they are, they're rejecting trivially verifiable reality. But not every criticism and has to also contain positive points. There's plenty of papers and a lot of hype already doing that. And if we're going to be critical of anything, shouldn't it be that the companies creating these models -- selling them, and even charging researchers to perform these types of experiments that the can and are used to improve their products -- should be much more clear about the limitations of their models? If we need balance, then I think there's bigger fish to fry than Auburn and Alberta Universities.

ClumsyPilot 1 year ago | | |

> Second, so what if a test was designed to trick up a model? Shouldn't we be determining when and where models fail? Is that not a critical question in understanding how to use them properly?

People are rushing to build this AI into all kinds of products, and they actively don’t want to know where the problems are.

The real world outside is designed to trip up the model. Strange things happen all the time.

Because software developers have no governing body, no oaths of ethics and no spine someone will end up dead in a ditch from malfunctioning AI.

orbital-decay 1 year ago | | |

> I think this is a communication issue and you're being a bit myopic in your interpretation. It is clearly an analogy meant for communication and is not an actual hypothesis.

I don't know, words have meanings. If that's a communication issue, it's on part of the authors. To me, this wording in a what is supposed to be a research paper abstract clearly suggests the insufficient resolution as the cause. How else should I interpret it?

> The shockingly poor performance of four state-of-the-art VLMs suggests their vision is, at best, like that of a person with myopia seeing fine details as blurry

And indeed, increasing the resolution is expensive, and the best VLMs have something like 1000x1000. But the low resolution is clearly not the issue here, and the authors don't actually talk about it in the paper.

>I find it hard to believe that there is no example you can give.

I'm not the person you're answering to, but I actually lazily tried two of authors' examples in a less performant VLM (CogVLM), and was surprised it passed those, making me wonder whether I can trust their conclusions until I reproduce their results. LLMs and VLMs have all kinds of weird failure modes, it's not a secret they fail at some trivial tasks and their behavior is still not well understood. But working with these models and narrowing it down is notoriously like trying to nail a jelly to the wall. If I was able to do this in a cursory check, what else is there? More than one research paper in this area is wrong from the start.

itkovian_ 1 year ago | |

I think gpt4o is probably doing some ocr as preprocessing. It's not really controversial to say the vmls today don't pick up fine grained details - we all know this. Can just look at the output of a vae to know this is true.

thomasahle 1 year ago | | |

If so, it's better than any other ocr on the market.

I think they just train it on a bunch of text.

Maybe counting squares in a grid was not probably considered important enough to train for.

_flux 1 year ago | | |

Why do you think it's probable? The much smaller llava that I can run in my consumer GPU can also do "OCR", yet I don't believe anyone has hidden any OCR engine inside llama.cpp.

kalkin 1 year ago | |

There's definitely something interesting to be learned from the examples here - it's valuable work in that sense - but "VLMs are blind" isn't it. That's just clickbait.

diwank 1 year ago | |

Yeah I think their findings are def interesting but the title and the strong claims are a tad hyperbolic.

jetrink 1 year ago |

I had a remarkable experience with GPT-4o yesterday. Our garage door started to fall down recently, so I inspected it and found that our landlord had installed the wire rope clips incorrectly, leading to the torsion cables losing tension. I didn't know what that piece of hardware was called, so I asked ChatGPT and it identified the part as I expected it to. As a test, I asked if there was anything notable about the photo. ChatGPT correctly identified that the cables were installed backwards, with the side of the cable that was (previously) under tension on top of the slack end, instead of sandwiched securely in the middle. To diagnose that requires tracing the cable through space and inferring which end is under tension from the geometry, though I can't rule out an educated guess.

What was really remarkable though was that it failed to notice that one of the two nuts was obviously missing, even after I told it there was a second problem with the installation.

Screenshot: https://imgur.com/a/QqCNzOM

sfink 1 year ago | |

A human would need to trace the cable. An LLM may just be responding based on (1) the fact that you're asking about the clip in the first place, and that commonly happens when there's something wrong; and (2) that this is a very common failure mode. This is supported by it bringing up the "never saddle a dead horse" mnemonic, which suggests the issue is common.

After you fix it, you should try asking the same questions!

fn-mote 1 year ago | |

As a human, I was unable to see enough in that picture to infer which side was supposed to be under tension. I’m not trained, but I know what I expected to see from your description.

Like my sister post, I’m skeptical that the LLM didn’t just get lucky.

MagicMoonlight 1 year ago | |

To trace it through space it would need short term memory and the ability to think. It does not have it. It must therefore be guessing.

JeremyHerrman 1 year ago |

VLMs so far have never been good at counting objects or spatial relationships (e.g. the coffee is to the right of the microwave).

There are ways to help the VLM out - Set of Marks [0] from Microsoft being the most prominent, which uses segmentation to outline and label sections of the image before sending to the VLM.

Providing "speakable" labels to regions helps ground the visual abilities of VLMs and is why in this paper the performance is so much better when words are present in the grid for "Task 6: Counting the rows and columns of a grid"

0: https://github.com/microsoft/SoM

joelburget 1 year ago |

Vision Transformers do a shocking amount of compression in the tokenizer. In the [Chameleon paper](https://arxiv.org/pdf/2405.09818) they say the tokenizer "encodes a 512 × 512 image into 1024 discrete tokens from a codebook of size 8192". That's 256 pixels per token (512 * 512 / 1024). If we assume that a pixel is 24 bits (3x 8 bit channels), this implies that they've compressed 256 * 24 = 6144 bits into 13 = (log2(8192)). [An Image is Worth 32 Tokens for Reconstruction and Generation](https://yucornetto.github.io/projects/titok.html) pushes this even further. If these models work similarly, it's no wonder they struggle with some vision tasks.

ec109685 1 year ago | |

It’s not as simple as that. If you ask GPT-4o to create a copy of these images, it generally creates one faithfully (e.g. an image with 5 squares will be produced), so it’s “seeing” things reasonably enough.

It doesn’t seem to have the logic though to answer these questions.

The complete data set is here to play around with it yourself: https://huggingface.co/datasets/XAI/vlmsareblind/viewer/defa...

energy123 1 year ago | |

GPT-4o is very good at some visual tasks like optical character recognition. So the selective blindness might just be what you say here -- all of its capacity is dedicated to minimizing loss on a few narrow tasks that had the most training data (like OCR). So it's not necessarily an inherent failure of the architecture to generalize, it could just be a capacity issue that will naturally be resolved with more scale.

sushid 1 year ago | | |

Is that not just traditional OCR applied on top of LLM?

buryat 1 year ago | |

for some reason I started thinking about trying to describe the taste of a fruit to someone who hasn't tried it as something that can be similar to this as a non-visual sensory modal in humans

cs702 1 year ago |

Wow, that is embarrassingly bad performance for current SOTA models (GPT-4o, Gemini-1.5 Pro, Sonnet-3, Sonnet-3.5), which are advertised and sold as being able to understand images, e.g., for guiding the blind or tutoring children in geometry!

The tasks at which they fail are ridiculously simple for human beings, including, for example:

* counting the number of times two lines intersect;

* detecting whether two circles overlap;

* selecting which letter is being circled in a word;

* counting the number of circles in an Olympic-like logo.

This should be at the top of the front page.

diwank 1 year ago |

I think while the findings are interesting, the title "Vision language models are blind" is hyperbolic and misleading of the findings. It is true that VLMs have a very different way of sensing and working with image inputs and for low resolutions each image is split up into blocks and then mapped into discrete tokens. They actually really dont have access to finegrained details because this mapping is quite lossy. So, in that sense, these results completely make sense and were not surprising to me but "blind" is a strong implication and the work isnt really able to back that.

Take the first example of the four 2D line plots for instance. I asked Sonnet 3.5 for an answer and retried it 5 times and 2 out of 5 times, it said something like this:

Me:

here's an image with four 2-D line plots. Can you tell me A) how many lines are in each plot? and B) how many times do they intersect each other in each plot?

Write your answers as: Plot 1 (leftmost): ...

Sonnet:

I'll analyze each plot from left to right and provide the requested information:

Plot 1 (leftmost): A) 2 lines (one red, one blue) B) 0 intersections (the lines do not cross)

Plot 2: A) 2 lines (one red, one blue) B) 1 intersection (the lines cross once near the right edge)

Plot 3: A) 2 lines (one red, one blue) B) 1 intersection (the lines cross once near the center)

Plot 4 (rightmost): A) 2 lines (one red, one blue) B) 2 intersections (the lines cross twice, once near the left side and once near the right side)

Each plot consistently shows two lines, one red and one blue, with varying intersection patterns across the four plots.

---

Which isn't completely correct (the 3rd plot has 2 intersections) but still pretty good and the other 3 times, it said plot 3 has 1 intersection.

akavi 1 year ago |

Speaking as someone with only a tenuous grasp of how VLMs work, this naïvely feels like a place where the "embodiement" folks might have a point: Humans have the ability to "refine" their perception of an image iteratively, focusing in on areas of interest, while VLMs have to process the entire image at the same level of fidelity.

I'm curious if there'd be a way to emulate this (have the visual tokens be low fidelity at first, but allow the VLM to emit tokens that correspond to "focusing" on a region of the image with greater resolution). I'm not sure if/how it's possible to performantly train a model with "interactive" data like that, though

poikroequ 1 year ago |

It's ironic, they fail these seemingly simple tests that are trivial even for a child to solve. Yet, I used Gemini to read a postcard containing handwritten Russian cursive text with lots of visual noise (postmarks and whatnot). It was able to read the text and translate it into English. I didn't even need to tell it the text is Russian.

On the one hand, it's incredible what these LLMs are capable of. On the other hand, they often fall flat on their face with seemingly simple problems like this. We are seeing the same from self driving cars, getting into accidents in scenarios that almost any human driver could have easily avoided.

slashdave 1 year ago | |

Simple for a child, yes. Because we have evolved our vision to recognize patterns like this, because they are important for survival. Reading Russian is not.

From an algorithmic point of view, these vision tasks are actually quite difficult to explicitly program.

mglz 1 year ago |

I tought some Computational Geometry courses and efficiently computing the intersections of N line segments is not as straightforward as you might initially think. Since somewhere some computation must be done to recognize this and LLMs are not specifically trained for this task, it's not suprising they struggle.

In general, basic geometry seems under-explored by learning.

jordan_bonecut 1 year ago | |

Yes, but so is telling if a photo contains a dog or understanding sentiment in a paragraph of text. Complexity isn't quite the issue, I think it is that there is a distinction between the type of reasoning which these models have learnt and that which is necessary for concrete mathematical reasoning.

slashdave 1 year ago | | |

The models do not reason. They have learned associations, because these associations have appeared in their training sets.

samatman 1 year ago | |

> Since somewhere some computation must be done to recognize this

Humans don't have a "compute intersections" ability (other than a few who have learned it laboriously through algebra), we have a "see things and count them" mechanism. We aren't visually taking lines in a planar space and determining where they cross. We know what an intersection looks like, we see one, increment a counter, and find the next one. If it's less than around five, we do this all at once. Otherwise we literally count, sometimes in small groups, sometimes one at a time.

GaggiX 1 year ago |

Well, all the models (especially Claude 3.5 Sonnet) seem to perform much better than random, so they are clearly not blind. The only task where Claude Sonnet 3.5 does not perform better than random is the one where you have to follow many different paths (the ones where the answer from A to C is 3), something that would take me several seconds to solve.

I have the feeling that they first choose the title of the paper and then run the evaluation on the new Claude 3.5 Sonnet on these abstract images.

>their vision is, at best, like that of a person with myopia seeing fine details as blurry

This also makes no sense, since the images evaluate the abstract capabilities of the models, not their eyesight.

randcraw 1 year ago | |

OK. They're legally blind.

GaggiX 1 year ago | | |

This really has nothing to do with vision impairment.

yantrams 1 year ago |

Tested these problems with llava-v1.6-mistral-7b and the results aren't bad. Maybe I just got lucky with these samples

Intersecting Lines https://replicate.com/p/s24aeawxasrgj0cgkzabtj53rc

Overlapping Circles https://replicate.com/p/0w026pgbgxrgg0cgkzcv11k384

Touching Circles https://replicate.com/p/105se4p2mnrgm0cgkzcvm83tdc

Circled Text https://replicate.com/p/3kdrb26nwdrgj0cgkzerez14wc

Nested Squares https://replicate.com/p/1ycah63hr1rgg0cgkzf99srpxm

simonw 1 year ago | |

These are really interesting examples, thanks for sharing.

yantrams 1 year ago | | |

You're welcome. I recently noticed I get better performance with VLMs when the queries are phrased this way - Descriptive Keys instead of explaining the problem in sentences. Similar to COT reasoning that many people claim gives better results, I personally found querying in this sequence - existenceOfEntity, numberOfEntities followed by propertiesOfEntities etc tends to give better results. I haven't verified any of this rigorously so please do take it with a pinch of salt :)

taesiri 1 year ago |

This paper examines the limitations of current vision-based language models, such as GPT-4 and Sonnet 3.5, in performing low-level vision tasks. Despite their high scores on numerous multimodal benchmarks, these models often fail on very basic cases. This raises a crucial question: are we evaluating these models accurately?

simonw 1 year ago |

I've been generally frustrated at the lack of analysis of vision LLMs generally.

They're clearly a very exciting category of technology, and a pretty recent one - they only got good last October with GPT-4 Vision, but since then we've had more vision models from Anthropic and Google Gemini.

There's so much more information about there about text prompting compared to image prompting. I feel starved for useful information about their capabilities: what are vision models good and bad at, and what are the best ways to put them to work?

r2_pilot 1 year ago | |

Why not use them yourself if you have access? I have been using Claude 3.5 Sonnet for gardening recently, and while it's not perfect(and can be a little blind unless you tell it to focus on a specific thing), it's helped me understand how to keep my plants alive in some challenging conditions(for me; this is my second or third attempt at gardening so it's all challenging lol). But just a experiment with it and see where the capabilities lie. I do agree that certain classes of visual data are challenging for it.

simonw 1 year ago | | |

I've used them a bunch. I want to learn from other people's experiences as well.

Some of my notes so far:

- https://simonwillison.net/2024/Apr/17/ai-for-data-journalism... - my datasette-extract plugin, for structured data from both text and images

- https://simonwillison.net/2024/Apr/17/ai-for-data-journalism... - where they failed to extract data from a handwritten scanned document in various weird ways

- https://simonwillison.net/2024/Feb/21/gemini-pro-video/ talks about video inputs to Gemini Pro (which are actually image inputs, it splits them up to one frame per second)

simonw 1 year ago | |

Anthropic have some interesting cookbook examples that provide advice on using their multimodal models here: https://github.com/anthropics/anthropic-cookbook/tree/main/m...

I've assembled a bunch more notes here: https://simonwillison.net/tags/vision-llms/

dheera 1 year ago |

Current approaches of multi-modal models work on embeddings and tokenizations of images, which is the fundamental problem: you are feeding blurry, non-precise data into the model. Yes, they are blind because of exactly this.

An embedding isn't conceptually that much different from feeding a 1024-word description of an image instead of the actual image.

At the moment compute power isn't good enough to feed high-res pixel data into these models, unless we discover a vastly different architecture, which I am also convinced likely exists.

visarga 1 year ago | |

> An embedding isn't conceptually that much different from feeding a 1024-word description of an image instead of the actual image.

An embedding needs less words. You can embed individual words, phrases, like a whole prompt and longer paragraphs. You don't need 1024 words for a text embed. For example a famous library is called Sentence BERT (sbert).

When you embed images on the other hand, you cut them up into little squares on the tune of 32x32 px, and embeds one of them separately. chatGPT uses something like 250 tokens for smaller images. So a smaller image costs about as much as 200 words if represented graphically, and maybe much less words if you embed a text description of it.

dheera 1 year ago | | |

> needs less words

Yes I'm aware of this, and work in ML -- the thing is embeddings are not designed for faithful image reconstruction, and aren't even trained that way. You can easily find two images that have substantially similar CLIP (or whatever) embeddings that are visually very different. If you query the LLM about that difference, the LLM wouldn't even have the information to differentiate answers for the two images if you only supply it with the embedding.

On the other hand, SDXL autoencoder latents passed into an LLM alongside the embedding might be a step up from just an image embedding, since they are designed for image reconstruction, but I don't have access to the compute or data resources to attempt training this.

jayd16 1 year ago | |

Doesn't Gemini have a 2 million token limit for exactly this?

diwank 1 year ago | | |

The number of tokens per image are actually fairly small, ranging from 85 to ~500.

jeromeparadis 1 year ago |

One use-case I always try is to have an AI try to read a school calendar image where days off are or days of interest are highlighted using a legend. i.e.: days with a square, circle or triangle or different color, etc.

When asking days for specific days of interest for the school year, AIs always struggle. They get some days right but forget some or fabulate new days. They fare a bit better if you remove some of the noise and give them only a picture of a month but even then, it's unreliable.

_vaporwave_ 1 year ago |

It's really interesting that there's a huge performance discrepancy between these SOTA models. In the Olympic logo example, GPT-4o is below the baseline accuracy of 20% (worse than randomly guessing) while Sonnet-3.5 was correct ~76% of the time.

Does anyone have any technical insight or intuition as to why this large variation exists?

ec109685 1 year ago | |

The question wasn’t “yes or no” but instead required an exact number: https://huggingface.co/datasets/XAI/vlmsareblind/viewer/defa...

Playing around with GPT-4o, it knows enough to make a copy of an image that is reasonable but it still can’t answer the questions.

ChatGPT went down a rabbit hole of trying to write python code, but it took lots of prompting for it to notice its mistake when solving one of the intersecting line questions.

pjs_ 1 year ago |

I don't like this paper for the following reasons:

- The language is unnecessarily scathing

- They repeatedly show data where the models are getting things right 70, 80, 90% of the time, and then show a list of what they call "qualitative samples" (what does "qualitative" mean? "cherry-picked"?) which look very bad. But it got the answer right 70/80/90% of the time! That's hardly "blind"...

- Various of the tasks hinge on the distinction between two objects "exactly touching" vs. "very nearly touching" vs. "very slightly overlapping", a problem which (i) is hard for humans and (ii) is particularly (presumably deliberately) sensitive to resolution/precision, where we should not be surprised that models fail

- The main fish-shaped example given in task 1 seems genuinely ambiguous to me - do the lines "intersect" once or twice? The tail of the fish clearly has a crossing, but the nose of the fish seems a bit fishy to me... is that really an intersection?

- AFAIC deranged skepticism is just as bad as deranged hype, the framing here is at risk of appealing to the former

It's absolutely fair to make the point that these models are not perfect, fail a bunch of the time, and to point out the edge cases where they suck. That moves the field forwards. But the hyperbole (as pointed out by another commenter) is very annoying.

gnutrino 1 year ago |

My guess is that the systems are running image recognition models, and maybe OCR on images, and then just piping that data as tokens into an LLM. So you are only ever going to get results as good as existing images models with the results filtered through an LLM.

To me, this is only interesting if compared with results of image recognition models that can already answer these types of questions (if they even exist, I haven't looked).

Maybe the service is smart enough to look at the question, and then choose one or more models to process the image, but not sure as I can't find anything on their sites about how it works.

simonw 1 year ago | |

That's not how they work. The original GPT-4 paper has some detail: https://cdn.openai.com/papers/gpt-4.pdf

Or read up on PaliGemma: https://github.com/google-research/big_vision/blob/main/big_...

gnutrino 1 year ago | | |

Thanks, I'll read up on this.

Eisenstein 1 year ago | |

> My guess is that the systems are running image recognition models

Your guess is incorrect. Look up CLIP, BLIP, and SigLip for an idea of how they work.

gnutrino 1 year ago | | |

Will do, thank you.

jordan_bonecut 1 year ago |

This is an interesting article and goes along with how I understand how such models interpret input data. I'm not sure I would characterize the results as blurry vision, but maybe an inability to process what they see in a concrete manner.

All the LLMs and multi-modal models I've seen lack concrete reasoning. For instance, ask ChatGPT to perform 2 tasks, to summarize a chunk of text and to count how many words are in this chunk. ChatGPT will do a very good job summarizing the text and an awful job at counting the words. ChatGPT and all the transformer based models I've seen fail at similar concrete/mathematical reasoning tasks. This is the core problem of creating AGI and it generally seems like no one has made any progress towards synthesizing something with both a high and low level of intelligence.

My (unproven and probably incorrect) theory is that under the hood these networks lack information processing loops which make recursive tasks, like solving a math problem, very difficult.

scarface_74 1 year ago | |

Out of curiosity, I tried your test with ChatGPT 4o

https://chatgpt.com/share/79c5c6e1-e6a9-441b-acb3-54882303a8...

Of course as usual, LLMs are horrible with Math.

Funny enough, the next time it verified the word count by counting it out until I specifically told it to use Python

https://chatgpt.com/share/79e7b922-9b0f-4df9-98d0-2cd72d7041...

infiar 1 year ago | | |

This counting words task reminded me of a youtube video: https://www.youtube.com/watch?v=-9XKiOXaHlI Maybe LLMs are somehow more like monkeys.

mr_toad 1 year ago | |

Counting is hard, even for humans.

A child will start to speak at around the age of one, but most will be about two before they start to count. And it is even longer (maybe the age of three to four) before they understand cardinality and can reliably follow “simple” instructions like “bring me four blocks”.

And basic arithmetic without counting on their fingers is usually not picked up until they are around six or seven.

Rebuff5007 1 year ago |

In fairness, Mira Murati said GPT-4 is only high school level [1]. Maybe it takes PhD level to understand basic shapes?

[1] https://www.ccn.com/news/technology/openais-gpt-5-phd-level-...

londons_explore 1 year ago |

Could some of the "wrong" answers be the LLM attempting to give an explanation rather than the answer, eg. instead of answering 'X', the LLM answers 'The letter is partially hidden by the oval, so cannot be certain, but it appears to be the english letter X'.

The scoring criteria would rank this answer as 'T', which is wrong.

orbital-decay 1 year ago |

That's not anything like "myopia", though.

FWIW I tried the line intersection and the circled letter test from the article with CogVLM (which is far from reaching the current SotA) and it correctly passed both. I haven't tried it with Sonnet/4o but I suspect there might be something wrong with how the author did their tests. Don't get me wrong, but too many "the model can't do that" claims ended up with demonstrations of the model doing exactly that...

Log_out_ 1 year ago |

Chat gpt write me an argument that humans are blind because https://en.m.wikipedia.org/wiki/Optical_illusion exist.

Alexa experience that tragic irony for me.

Siri.forget it.

randomtree 1 year ago |

I guess I know what's coming to every captcha tomorrow.

rezaghanbari1 1 year ago |

Some of these samples are shocking. How do these models answer chart-based questions, I mean when they can't even count the intersections between two lines?

RodgerTheGreat 1 year ago | |

Same way they answer any question: piece together a statistically probable sequence of words to follow the prompt. All they know about an image is a handful of words a classifier might choose to describe it. If those words have nothing to do with the question being asked, they can't nudge the model in the general direction of a correct answer, so it's a crapshoot- even moreso than usual.

joefourier 1 year ago | | |

That’s not at all how multi-modal LLMs work - their visual input is not words generated by a classifier. Instead the image is divided into patches and tokenised by a visual encoder (essentially, it is compressed), and then fed directly as a sequence to the model.

imtringued 1 year ago | |

The dataset most likely contains chart descriptions that describe the raw data, but not the visual interactions of the individual pixels.

nichohel 1 year ago |

Vision language models are blind because they lack the Cartesian Theater, which you and I have. Which you and I say we have.

mr_toad 1 year ago | |

May as well argue that they can’t really know things because they lack an immortal soul.

codeulike 1 year ago | |

Does the part of you that 'looks at' your cartesian theatre also have a cartesian theatre?

fleshmonad 1 year ago | |

[citation needed]

aaroninsf 1 year ago |

The title for this page and argument should be qualified with the specific generation of tools.

That's in the abstract, but, it's bad to not be specific. In this case, because current public-facing models are WIWEB: the worst it will ever be.

And there are trillion-dollar prizes at stake, so, improvement is happening as quickly as it possibly can.

Jack000 1 year ago |

This is kind of the visual equivalent of asking an LLM to count letters. The failure is more related to the tokenization scheme than the underlying quality of the model.

I'm not certain about the specific models tested, but some VLMs just embed the image modality into a single vector, making these tasks literally impossible to solve.

axblount 1 year ago |

Would you say they have Blindsight?

michaelhoney 1 year ago |

This says to me that there are huge opportunities for improvement in providing vision modules for LLMs. Human minds aren't made of just one kind of thing: we have all sorts of hacky modular capabilities – there's no reason to think that a future AGI wouldn't also.

tantalor 1 year ago |

Are the "random-baseline accuracy" numbers correct?

In the "Two circles" test, do they really have 50% chance of overlapping? I think this comes from "Distances between circle perimeters: -0.15 to 0.5 times the diameter" but doesn't say the distribution they use.

jdlshore 1 year ago | |

They asked the AI a question with a yes/no response. If the AI chose randomly, it would be correct 50% of the time. That’s what “random baseline accuracy” means.

viraptor 1 year ago |

I love some of the interpretations there. For example "Fig. 10: Only Sonnet-3.5 can count the squares in a majority of the images.", when that model simply returns "4" for every question and happens to be right.

vessenes 1 year ago |

A few comments below talk about how tokenizing images using stuff like CLIP de-facto yields blurry image descriptions, and so these are ‘blind’ by some definitions. Another angle of blurring not much discussed is that the images are rescaled down; different resolutions for different models. I wouldn’t be surprised if Sonnet 3.5 had a higher res base image it feeds in to the model.

Either way, I would guess that we’ll need new model architectures for multimodal to get really good at some of this, and even then some of these tasks are adjacent to things that we know LLMs are already bad at (numeric logic, for instance).

As context lengths get longer, devoting more tokens to the image tokenization should help a bit here as well. Anyway, I’d anticipate next year we’d see 80s and 90s for most of these scores with next gen models.

imtringued 1 year ago | |

The problem with the current crop of projectors such as LLaVA is that as far as I know do not take the previous conversation into account. You only really get zero shot responses. This means that you cannot steer the model towards paying attention to specific instruction related details. The projector simply creates a token representation of the visuals (not necessarily human language tokens) and the LLM just processes that as usual.

vessenes 1 year ago | | |

The original gpt4 did this too, it had almost no memory before or after the image provided. I haven’t tested gpt4o on this directly, but my feeling is that it’s better from casual usage.

I do think some of these thin line drawings are likely extra hard to tokenize depending on the image scaling sizes for tokenization. I’d wager thicker lines would help, although obviously not all of this is just ‘poor tokenization’.

ec109685 1 year ago | |

At least for gpt 4o, it can create a facsimile of images that it still can’t analyze properly, so I think it’s more than just its “eyes” that are broken.

It clearly wasn’t trained on this task and suffers accordingly.

However, with chatgpt, it will create python to do the analysis and has better results.

iamleppert 1 year ago |

This could easily be fixed with training and fine tuning. Simply generate 100,000 examples or so, and train with ground truth until however long you want and its a solved problem.

kristjansson 1 year ago | |

Solved for this benchmark... and at what cost to the rest of the system?

These tasks are interesting because they're existence proofs of generalization failure. Like the haystack problem, direct solutions here are much less interesting than structural improvements that address the class of failure.

imtringued 1 year ago | |

Ok, but most of the data is just captions for images. You're going to have to invest some time into building this dataset at your own expense.

make3 1 year ago |

Hugged to death from my perspective. Here is a backup: https://archive.ph/kOE3Q

simonw 1 year ago | |

That's weird - GitHub Pages serves static content and rarely (in my experience) fails to load.

kristianpaul 1 year ago |

We see through thoughts and memories. We see when we desire, the vision just adds on a word pf thoughts and consciousness of being conscious.

Vision links thoughts with reality

childintime 1 year ago |

Claude 3.5 does remarkably well though on many tasks, compared to the others, and on those it's not at all blind. It's getting there.

navaed01 1 year ago |

Is there a good primer on how these vision LlmS work?

nyxtom 1 year ago |

I wonder how well Alpha Geometry would do on this

nybsjytm 1 year ago | |

AlphaGeometry is a hyper-specific system trained to add auxiliary geometric objects, like extra lines, to existing Euclidean geometry configurations. These prompts are not even sensible inputs to AlphaGeometry.

jackblemming 1 year ago |

Ask it to draw any of those things and it can.

cpill 1 year ago |

I wonder how they would score if they used all 4 models and took a majority vote...?

nmca 1 year ago |

please use this opportunity to reflect on whether ARC measures reasoning skills :)

nothrowaways 1 year ago |

The next version will solve all of it.

mkoubaa 1 year ago |

They interact with pixel buffers as a mathematical array. To call them blind is to confuse what they doing with the experience of sight...

codeulike 1 year ago | |

Humans 'see' by tightly packed rods and cones in the retina sending signals up the optic nerve. Not as tidy as a mathematical array but nonetheless not all that different. Ultimately what comes to the brain from the retina can be thought of as a data structure of sorts.

verbalstoner 1 year ago |

It's virtually impossible to take a paper seriously when the title has an emoji.

spullara 1 year ago |

in other news, vision models are bad at things they aren't trained to do

hi_dang_ 1 year ago |

I was hoping that someone in the comments talking the paper down would have published a paper or have had relevant publications of their own to point to. You know, meet the lads halfway sort of thing.

So what I’m left with to judge instead is anonymous online commenters vs. the publication of 2 prestigious universities. Whose word do I take on this? Decisions, decisions.

You can swap LM out with Web3 out with NFT out with Crypto in this case.

warkdarrior 1 year ago | |

> I’m left with [...] is anonymous online commenters vs. the publication of 2 prestigious universities. Whose word do I take on this?

Maybe you need to judge the contents of those online comments and the contents of the publication, instead of relying on argument from authority.