Vision language models are blind(vlmsareblind.github.io) |
Vision language models are blind(vlmsareblind.github.io) |
That's in the abstract, but, it's bad to not be specific. In this case, because current public-facing models are WIWEB: the worst it will ever be.
And there are trillion-dollar prizes at stake, so, improvement is happening as quickly as it possibly can.
I'm not certain about the specific models tested, but some VLMs just embed the image modality into a single vector, making these tasks literally impossible to solve.
In the "Two circles" test, do they really have 50% chance of overlapping? I think this comes from "Distances between circle perimeters: -0.15 to 0.5 times the diameter" but doesn't say the distribution they use.
Either way, I would guess that we’ll need new model architectures for multimodal to get really good at some of this, and even then some of these tasks are adjacent to things that we know LLMs are already bad at (numeric logic, for instance).
As context lengths get longer, devoting more tokens to the image tokenization should help a bit here as well. Anyway, I’d anticipate next year we’d see 80s and 90s for most of these scores with next gen models.
I do think some of these thin line drawings are likely extra hard to tokenize depending on the image scaling sizes for tokenization. I’d wager thicker lines would help, although obviously not all of this is just ‘poor tokenization’.
It clearly wasn’t trained on this task and suffers accordingly.
However, with chatgpt, it will create python to do the analysis and has better results.
These tasks are interesting because they're existence proofs of generalization failure. Like the haystack problem, direct solutions here are much less interesting than structural improvements that address the class of failure.
Vision links thoughts with reality
So what I’m left with to judge instead is anonymous online commenters vs. the publication of 2 prestigious universities. Whose word do I take on this? Decisions, decisions.
You can swap LM out with Web3 out with NFT out with Crypto in this case.
Maybe you need to judge the contents of those online comments and the contents of the publication, instead of relying on argument from authority.
> their vision is, at best, like that of a person with myopia seeing fine details as blurry
is a crazy thing to write in an abstract. Did they try to probe that hypothesis at all? I could (well actually I can't) share some examples from my job of GPT-4v doing some pretty difficult fine-grained visual tasks that invalidate this.
Personally, I rate this paper [1], which makes the argument that these huge GenAI models are pretty good at things - assuming that it has seen a LOT of that type of data during training (which is true of a great many things). If you make up tasks like this, then yes can be REALLY bad at them, and initial impressions of AGI get harder to justify. But in practice, we aren't just making up tasks to trip up these models. They can be very performant on some tasks and the authors have not presented any real evidence about these two modes.
This paper demonstrates that there are still some major gaps where simple problems confound the models in unexpected ways. These is important work to elevate otherwise people may start to believe that these models are suitable for general application when they still need safeguards and copious warnings.
"Large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini-1.5 Pro are powering countless image-text processing applications"
I don't know how many a "countless" is, but I think we've gotten really sloppy in terms of what counts for LLMs as a demonstrated, durable win in a concrete task attached to well-measured outcomes and holding up over even modest periods of time.
This stuff is really promising and lots of builders are making lots of nifty things, so if that counts as an application then maybe we're at countless, but in the enterprise and in government and in refereed academic literature we seem to be at the proof-of-concept phase. Impressive chat bots as a use case are pretty dialed in, enough people claim that they help with coding that I tend to believe it's a real thing (I never seem to come out ahead of going directly to the source, StackOverflow).
The amount of breathless press on this seems "countless", so maybe I missed the totally rigorous case study on how X company became Y percent more profitable by doing Z thing with LLMs (or similar), and if so I'd be grateful for citations, but neither Google nor any of the big models seem to know about it.
You see an object. First you have to learn how to control all your body functions to move toward it and grasp it. This teaches you about the 3 dimensional world and things like gravity. You may not know the terms, but it is baked in your learning model. After you get an object you start building a classification list "hot", "sharp", "soft and fuzzy", "tasty", "slick". Your learning model builds up a list of properties of objects and "expected" properties of objects.
Once you have this 'database' you create as a human, you can apply the logic to achieve tasks. "Walk 10 feet forward, but avoid the sharp glass just to the left". You have to have spatial awareness, object awareness, and prediction ability.
Models 'kind of' have this, but its seemingly haphazard, kind of like a child that doesn't know how to put all the pieces together yet. I think a lot of embodied robot testing where the embodied model feeds back training to the LLM/vision model will have to occur before this is even somewhat close to reliable.
With AINwe are creating intelligence but with different strengths and weaknesses. I think we will continue to be surprised at how well they work on some problems and how poor they do at some “simple” ones.
“These new tools aren’t perfect” is the dog bites man story of technology. It’s certainly true, but it’s no different than GPS (“family drives car off cliff because GPS said to”).
They have identified a class of problems that the models perform poorly at and have given a good description of the failure. They portray this as a representative example of the behaviour in general. This has not been shown and is probably not true.
I don't think that models have been portrayed as equivalent to humans. Like most AI in it has been shown as vastly superior in some areas and profoundly ignorant in others. Media can overblow things and enthusiasts can talk about future advances as if they have already arrived, but I don't think these are typical portayals by the AI Field in general.
Let's just ignore what a blind person normally needs to know.
You know what blind people ask? Sometimes there daily routine is broken because there is some type of construction and models can tell you this.
Sometimes they need to read a basic sign and models can do this.
Those models help people already and they will continue to get better.
I'm not sure if I'm more frustrated how condescending the authors are or your ignorance.
Valid criticism doesn't need to be shitty
Like, seriously. One poster below whines about "AI apologists" and BeMyEyes, but again, it's all trivially testable with your phone and $20/month subscription. It works spectacularly well on real world tasks. Not perfectly, sure, but good enough to be useful in practice and better than alternatives (which often don't exist).
It's not that far from reality, most models sees images in very low resolution/limited colors, so not so far from this description
First of all myopia isn't 'seeing fine details as blurry' - it's nearsightedness - and whatever else this post tested it definitely didn't test depth perception.
And second - inability to see fine details is a distinct/different thing from not being able to count intersections and the other things tested here. That hypothesis, if valid, would imply that improving the resolution of the image that the model can process would improve its performance on these tasks even if reasoning abilities were the same. That - does not make sense. Plenty of the details in these images that these models are tripping up on are perfectly distinguishable at low resolutions. Counting rows and columns of blank grids is not going to improve with more resolution.
I mean, I'd argue that the phrasing of the hypothesis ("At best, like that of a person with myopia") doesn't make sense at all. I don't think a person with myopia would have any trouble with these tasks if you zoomed into the relevant area, or held the image close. I have a very strong feeling that these models would continue to suffer on these tasks if you zoomed in. Nearsighted != unable to count squares.
Is this the sales pitch though? Because 15 years ago, I had a scanner with an app that can scan a text document and produce the text on Windows. The machine had something like 256Mb of RAM.
Tech can be extremely good at niches in isolation. You can have an OCR system 10 years ago and it'll be extremely reliable at the single task it's configured to do.
AI is supposed to bring a new paradigm, where the tech is not limited to the specific niche the developers have scoped it to. However, if it reliably fails to detect simple things a regular person should not get wrong, then the whole value proposition is kicked out of the window.
I like the idea that these models are so good at some sort of specific and secret bit of visual processing that things like “counting shapes” and “beating a coin toss for accuracy” shouldn’t be considered when evaluating them.
I think this is a communication issue and you're being a bit myopic in your interpretation. It is clearly an analogy meant for communication and is not an actual hypothesis. Sure, they could have used a better analogy and they could have done other tests, but the paper still counters quite common claims (from researchers) about VLMs.
> I could (well actually I can't) share some examples from my job of GPT-4v doing some pretty difficult fine-grained visual tasks that invalidate this.
I find it hard to believe that there is no example you can give. It surely doesn't have to be exactly your training data. If it is this good, surely you can create an example no problem. If you just don't want to, that's okay, but then don't say it.
But I have further questions. Do you have complicated prompting? Or any prompt engineering? It sure does matter how robust these models are to prompting. There's a huge difference between a model being able to accomplish a task and a model being able to perform a task in a non-very-specific environment. This is no different than something working in a tech demo and not in the hand of the user.
> But in practice, we aren't just making up tasks to trip up these models.
I see this sentiment quite often and it is baffling to me.
First off, these tasks are not clearly designed to trick these models. A model failing at a task is not suddenly "designed to trick a model." Its common with the river crossing puzzles where they're rewritten to be like "all animals can fit in the boat." If that is "designed to trick a model", then the model must be a stochastic parrot and not a generalist. It is very important that we test things where we do know the answer to because, unfortunately, we're not clairvoyant and can't test questions we don't know the answer to. Which is the common case in the real world usage.
Second, so what if a test was designed to trick up a model? Shouldn't we be determining when and where models fail? Is that not a critical question in understanding how to use them properly? This seems doubly important if they are tasks that humans don't have challenges with them.
> They can be very performant on some tasks and the authors have not presented any real evidence about these two modes.
I don't think people are claiming that large models can't be performant on some tasks. If they are, they're rejecting trivially verifiable reality. But not every criticism and has to also contain positive points. There's plenty of papers and a lot of hype already doing that. And if we're going to be critical of anything, shouldn't it be that the companies creating these models -- selling them, and even charging researchers to perform these types of experiments that the can and are used to improve their products -- should be much more clear about the limitations of their models? If we need balance, then I think there's bigger fish to fry than Auburn and Alberta Universities.
People are rushing to build this AI into all kinds of products, and they actively don’t want to know where the problems are.
The real world outside is designed to trip up the model. Strange things happen all the time.
Because software developers have no governing body, no oaths of ethics and no spine someone will end up dead in a ditch from malfunctioning AI.
I don't know, words have meanings. If that's a communication issue, it's on part of the authors. To me, this wording in a what is supposed to be a research paper abstract clearly suggests the insufficient resolution as the cause. How else should I interpret it?
> The shockingly poor performance of four state-of-the-art VLMs suggests their vision is, at best, like that of a person with myopia seeing fine details as blurry
And indeed, increasing the resolution is expensive, and the best VLMs have something like 1000x1000. But the low resolution is clearly not the issue here, and the authors don't actually talk about it in the paper.
>I find it hard to believe that there is no example you can give.
I'm not the person you're answering to, but I actually lazily tried two of authors' examples in a less performant VLM (CogVLM), and was surprised it passed those, making me wonder whether I can trust their conclusions until I reproduce their results. LLMs and VLMs have all kinds of weird failure modes, it's not a secret they fail at some trivial tasks and their behavior is still not well understood. But working with these models and narrowing it down is notoriously like trying to nail a jelly to the wall. If I was able to do this in a cursory check, what else is there? More than one research paper in this area is wrong from the start.
I think they just train it on a bunch of text.
Maybe counting squares in a grid was not probably considered important enough to train for.
What was really remarkable though was that it failed to notice that one of the two nuts was obviously missing, even after I told it there was a second problem with the installation.
Screenshot: https://imgur.com/a/QqCNzOM
After you fix it, you should try asking the same questions!
Like my sister post, I’m skeptical that the LLM didn’t just get lucky.
There are ways to help the VLM out - Set of Marks [0] from Microsoft being the most prominent, which uses segmentation to outline and label sections of the image before sending to the VLM.
Providing "speakable" labels to regions helps ground the visual abilities of VLMs and is why in this paper the performance is so much better when words are present in the grid for "Task 6: Counting the rows and columns of a grid"
It doesn’t seem to have the logic though to answer these questions.
The complete data set is here to play around with it yourself: https://huggingface.co/datasets/XAI/vlmsareblind/viewer/defa...
The tasks at which they fail are ridiculously simple for human beings, including, for example:
* counting the number of times two lines intersect;
* detecting whether two circles overlap;
* selecting which letter is being circled in a word;
* counting the number of circles in an Olympic-like logo.
This should be at the top of the front page.
Take the first example of the four 2D line plots for instance. I asked Sonnet 3.5 for an answer and retried it 5 times and 2 out of 5 times, it said something like this:
Me:
here's an image with four 2-D line plots. Can you tell me A) how many lines are in each plot? and B) how many times do they intersect each other in each plot?
Write your answers as: Plot 1 (leftmost): ...
Sonnet:
I'll analyze each plot from left to right and provide the requested information:
Plot 1 (leftmost): A) 2 lines (one red, one blue) B) 0 intersections (the lines do not cross)
Plot 2: A) 2 lines (one red, one blue) B) 1 intersection (the lines cross once near the right edge)
Plot 3: A) 2 lines (one red, one blue) B) 1 intersection (the lines cross once near the center)
Plot 4 (rightmost): A) 2 lines (one red, one blue) B) 2 intersections (the lines cross twice, once near the left side and once near the right side)
Each plot consistently shows two lines, one red and one blue, with varying intersection patterns across the four plots.
---
Which isn't completely correct (the 3rd plot has 2 intersections) but still pretty good and the other 3 times, it said plot 3 has 1 intersection.
I'm curious if there'd be a way to emulate this (have the visual tokens be low fidelity at first, but allow the VLM to emit tokens that correspond to "focusing" on a region of the image with greater resolution). I'm not sure if/how it's possible to performantly train a model with "interactive" data like that, though
On the one hand, it's incredible what these LLMs are capable of. On the other hand, they often fall flat on their face with seemingly simple problems like this. We are seeing the same from self driving cars, getting into accidents in scenarios that almost any human driver could have easily avoided.
From an algorithmic point of view, these vision tasks are actually quite difficult to explicitly program.
In general, basic geometry seems under-explored by learning.
Humans don't have a "compute intersections" ability (other than a few who have learned it laboriously through algebra), we have a "see things and count them" mechanism. We aren't visually taking lines in a planar space and determining where they cross. We know what an intersection looks like, we see one, increment a counter, and find the next one. If it's less than around five, we do this all at once. Otherwise we literally count, sometimes in small groups, sometimes one at a time.
I have the feeling that they first choose the title of the paper and then run the evaluation on the new Claude 3.5 Sonnet on these abstract images.
>their vision is, at best, like that of a person with myopia seeing fine details as blurry
This also makes no sense, since the images evaluate the abstract capabilities of the models, not their eyesight.
Intersecting Lines https://replicate.com/p/s24aeawxasrgj0cgkzabtj53rc
Overlapping Circles https://replicate.com/p/0w026pgbgxrgg0cgkzcv11k384
Touching Circles https://replicate.com/p/105se4p2mnrgm0cgkzcvm83tdc
Circled Text https://replicate.com/p/3kdrb26nwdrgj0cgkzerez14wc
Nested Squares https://replicate.com/p/1ycah63hr1rgg0cgkzf99srpxm
They're clearly a very exciting category of technology, and a pretty recent one - they only got good last October with GPT-4 Vision, but since then we've had more vision models from Anthropic and Google Gemini.
There's so much more information about there about text prompting compared to image prompting. I feel starved for useful information about their capabilities: what are vision models good and bad at, and what are the best ways to put them to work?
Some of my notes so far:
- https://simonwillison.net/2024/Apr/17/ai-for-data-journalism... - my datasette-extract plugin, for structured data from both text and images
- https://simonwillison.net/2024/Apr/17/ai-for-data-journalism... - where they failed to extract data from a handwritten scanned document in various weird ways
- https://simonwillison.net/2024/Feb/21/gemini-pro-video/ talks about video inputs to Gemini Pro (which are actually image inputs, it splits them up to one frame per second)
I've assembled a bunch more notes here: https://simonwillison.net/tags/vision-llms/
An embedding isn't conceptually that much different from feeding a 1024-word description of an image instead of the actual image.
At the moment compute power isn't good enough to feed high-res pixel data into these models, unless we discover a vastly different architecture, which I am also convinced likely exists.
An embedding needs less words. You can embed individual words, phrases, like a whole prompt and longer paragraphs. You don't need 1024 words for a text embed. For example a famous library is called Sentence BERT (sbert).
When you embed images on the other hand, you cut them up into little squares on the tune of 32x32 px, and embeds one of them separately. chatGPT uses something like 250 tokens for smaller images. So a smaller image costs about as much as 200 words if represented graphically, and maybe much less words if you embed a text description of it.
Yes I'm aware of this, and work in ML -- the thing is embeddings are not designed for faithful image reconstruction, and aren't even trained that way. You can easily find two images that have substantially similar CLIP (or whatever) embeddings that are visually very different. If you query the LLM about that difference, the LLM wouldn't even have the information to differentiate answers for the two images if you only supply it with the embedding.
On the other hand, SDXL autoencoder latents passed into an LLM alongside the embedding might be a step up from just an image embedding, since they are designed for image reconstruction, but I don't have access to the compute or data resources to attempt training this.
When asking days for specific days of interest for the school year, AIs always struggle. They get some days right but forget some or fabulate new days. They fare a bit better if you remove some of the noise and give them only a picture of a month but even then, it's unreliable.
Does anyone have any technical insight or intuition as to why this large variation exists?
Playing around with GPT-4o, it knows enough to make a copy of an image that is reasonable but it still can’t answer the questions.
ChatGPT went down a rabbit hole of trying to write python code, but it took lots of prompting for it to notice its mistake when solving one of the intersecting line questions.
- The language is unnecessarily scathing
- They repeatedly show data where the models are getting things right 70, 80, 90% of the time, and then show a list of what they call "qualitative samples" (what does "qualitative" mean? "cherry-picked"?) which look very bad. But it got the answer right 70/80/90% of the time! That's hardly "blind"...
- Various of the tasks hinge on the distinction between two objects "exactly touching" vs. "very nearly touching" vs. "very slightly overlapping", a problem which (i) is hard for humans and (ii) is particularly (presumably deliberately) sensitive to resolution/precision, where we should not be surprised that models fail
- The main fish-shaped example given in task 1 seems genuinely ambiguous to me - do the lines "intersect" once or twice? The tail of the fish clearly has a crossing, but the nose of the fish seems a bit fishy to me... is that really an intersection?
- AFAIC deranged skepticism is just as bad as deranged hype, the framing here is at risk of appealing to the former
It's absolutely fair to make the point that these models are not perfect, fail a bunch of the time, and to point out the edge cases where they suck. That moves the field forwards. But the hyperbole (as pointed out by another commenter) is very annoying.
To me, this is only interesting if compared with results of image recognition models that can already answer these types of questions (if they even exist, I haven't looked).
Maybe the service is smart enough to look at the question, and then choose one or more models to process the image, but not sure as I can't find anything on their sites about how it works.
Or read up on PaliGemma: https://github.com/google-research/big_vision/blob/main/big_...
Your guess is incorrect. Look up CLIP, BLIP, and SigLip for an idea of how they work.
All the LLMs and multi-modal models I've seen lack concrete reasoning. For instance, ask ChatGPT to perform 2 tasks, to summarize a chunk of text and to count how many words are in this chunk. ChatGPT will do a very good job summarizing the text and an awful job at counting the words. ChatGPT and all the transformer based models I've seen fail at similar concrete/mathematical reasoning tasks. This is the core problem of creating AGI and it generally seems like no one has made any progress towards synthesizing something with both a high and low level of intelligence.
My (unproven and probably incorrect) theory is that under the hood these networks lack information processing loops which make recursive tasks, like solving a math problem, very difficult.
https://chatgpt.com/share/79c5c6e1-e6a9-441b-acb3-54882303a8...
Of course as usual, LLMs are horrible with Math.
Funny enough, the next time it verified the word count by counting it out until I specifically told it to use Python
https://chatgpt.com/share/79e7b922-9b0f-4df9-98d0-2cd72d7041...
A child will start to speak at around the age of one, but most will be about two before they start to count. And it is even longer (maybe the age of three to four) before they understand cardinality and can reliably follow “simple” instructions like “bring me four blocks”.
And basic arithmetic without counting on their fingers is usually not picked up until they are around six or seven.
[1] https://www.ccn.com/news/technology/openais-gpt-5-phd-level-...
The scoring criteria would rank this answer as 'T', which is wrong.
FWIW I tried the line intersection and the circled letter test from the article with CogVLM (which is far from reaching the current SotA) and it correctly passed both. I haven't tried it with Sonnet/4o but I suspect there might be something wrong with how the author did their tests. Don't get me wrong, but too many "the model can't do that" claims ended up with demonstrations of the model doing exactly that...
Alexa experience that tragic irony for me.
Siri.forget it.
[0] https://distributedmuseum.illinois.edu/exhibit/biological_co...
[1] https://sites.evergreen.edu/arunchandra/wp-content/uploads/s...
Another benefit of VLMs which support object detection is that they are open vocabulary, meaning you don't have to define the classes ahead of time. Additionally fine tuning tends to keep the previous detection capabilities instead of erasing all previous classes like fine tuning YOLO.
It's entirely unsurprising that there are numerous cases that these models can't handle that are "obvious to humans." Machine learning has had this property since its invention and it's a classic mistake humans make dealing with these systems.
Humans assume that because a machine learning model has above human accuracy on task X that it implies that it must also have that ability at all the other tasks. While a human with amazing ability at X would indeed have amazing abilities at other tasks, this is not true of machine learning models The opposite thinking is also wrong, that because the model can't do well on task Y it must be unreliable and it's ability on task X is somehow an illusion and not to be trusted.
Evidently, all these models still fall short.
Promoting those use cases and models failing in these ways is irresponsible. So, yeah, maybe the models are not embarrasing but the hype definitely is.
It doesn't matter how they are marketed or described or held up to some standard generated by wishful thinking. And it especially doesn't matter what it would mean if a human were to make the same error.
It matters what they are, what they're doing, and how they're doing it. Feel free to be embarrassed if you are claiming they can do what they can't and are maybe even selling them on that basis. But there's nothing embarrassing about their current set of capabilities. They are very good at what they are very good at. Expecting those capabilities to generalize as they would if they were human is like getting embarrassed that your screwdriver can't pound in a nail, when it is ever so good at driving in screws.
No, it's a failure of the companies who are advertising them as capable of doing something which they are not (assisting people with low vision)
"My talking dog always makes mistakes on calculus problems: how embarrassingly bad!"
Has the expectation treadmill really advanced so quickly that sub-human performance on any category of problems is now an embarrassment?
But the model is downstream of the "patch" tokenization, so the cut-down in resolution (compression) of the image has already occurred prior to the point where the model can direct greater "attention".
I think the synthesis is that I'm proposing a per-pixel tokenization with a transformer block whose purpose is to output information at a compression level "equivalent" to that of the patch tokens (is this what an autoencoder is?), but where the attention vector is a function of the full state of the LLM (ie, inclusive of the text surrounding the image)).
Naïvely, I'd think a layer like this that is agnostic to the LLM state needn't be any more computationally costly than the patching computation (both are big honks of linear algebra?), but idk how expensive the "full context attention" feedback is...
(I apologize to anyone who actually understands transformers for my gratuitous (ab|mis)use of terminology)
It's not entirely known how much early learning in low resolution makes a difference in humans, and obviously that could also relate more to our specific neurobiology than a general truth about learning in connectionist systems. But I found it to be an interesting idea that maybe certain outcomes with ANNs could be influenced a lot by training paradigms s.t. not all shortcomings could be addressed with only updates to the core architecture.
That's not related to embodied cognition.
- "Only Sonnet-3.5 can count the squares in a majority of the images", but Sonnet-3, Gemini-1.5 and Sonnet-3.5 all have accuracy of >50%
- "Sonnet-3.5 tends to conservatively answer "No" regardless of the actual distance between the two circles.", but it somehow gets 91% accuracy? That doesn't sound like it tends to answer "No" regardless of distance.
ChatGPT whiffs completely on very obvious images.
Even still, if you ask it for token count it would still fail. My point is that it can’t count, the circuitry required to do so seems absent in these models
That's quite true. Words mean exactly what people agree upon them meaning. Which does not require everyone, or else slang wouldn't exist. Nor the dictionary, which significantly lags. Regardless, I do not think this is even an unusual use of the word, though I agree the mention of myopia is. The usage makes sense if you consider that both myopic and resolution have more than a singular meaning.
Myopic:
lacking in foresight or __discernment__ : narrow in perspective and without concern for broader implications
Resolution:
the process or capability of making distinguishable the individual parts of an object, closely adjacent optical images, or sources of light
I agree that there are far better ways to communicate. But my main gripe is that they said it was "their hypothesis." If reading the abstract as a whole, I find it an odd conclusion to come to. It doesn't pair with the words that follow with blind guessing (and I am not trying to defend the abstract. It is a bad abstract). But if you read the intro and look at the context of their landing page, I find it quite difficult to come to this conclusion. It is poorly written, but it is still not hard to decode the key concepts the authors are trying to convey.I feel the need to reiterate that language has 3 key aspects to it: the concept attempted to be conveyed, the words that concept is lossy encoded into, and the lossy decoding of the person interpreting it. Communication doesn't work by you reading/listening to words and looking up those words in a dictionary. Communication is a problem where you use words (context/body language/symbols/etc) to decrease the noise and get the reciever to reasonably decode your intended message. And unfortunately we're in a global world and many different factors, such as culture, greatly affect how one encodes and/or decodes language. It only becomes more important to recognize the fuzziness around language here. Being more strict and leaning into the database view of language only leads to more errors.
> But the low resolution is clearly not the issue here, and the authors don't actually talk about it in the paper.
Because they didn't claim that image size and sharpness was an issue. They claimed the VLM cannot resolve the images "as if" they were blurry. Determining what the VLM actually "sees" is quite challenging. And I'll mention that arguably they did test some factors that relate to blurriness. Which is why I'm willing to overlook the poor analogy.
> I actually lazily tried two of authors' examples in a less performant VLM (CogVLM), and was surprised it passed those
I'm not. Depending on the examples you pulled, 2 random ones passing isn't unlikely given the results.
Something I generally do not like about these types of papers is that they often do not consider augmentations. Since these models tend to be quite sensitive to both the text (prompt) inputs and image inputs. This is quite common in generators in general. Even the way you load in and scale an image can have significant performance differences. I've seen significant differences in simple things like loading an image from numpy, PIL, tensorflow, or torch have different results. But I have to hand it to these authors, they looked at some of this. In the appendix they go through with confusion matrices and look at the factors that determine misses. They could have gone deeper and tried other things, but it is a more than reasonable amount of work for a paper.
While myopia is common, it's not the best choice of analogy and "blurry vision" is probably clear enough.
Still, I'd only see it as a bad choice of analogy — I can't imagine anyone mistaking optical focus problems for static image processing problems — so in the usual HN recommendation, I'd treat their example in the most favourable sense.
Seeing AI 2016 Prototype - A Microsoft research project - https://youtu.be/R2mC-NUAmMk
https://www.seeingai.com are the actual working apps.
The version from 2016 I recall showing (pun not intended) to a coworker who had some significant vision impairments and he was really excited about what it could do back then.
---
I still remain quite impressed with its ability to parse the picture and likely reason behind it https://imgur.com/a/JZBTk2t
Goldman Sachs recently issued a report.
https://www.goldmansachs.com/intelligence/pages/gs-research/...
"We estimate that the AI infrastructure buildout will cost over $1tn in the next several years alone, which includes spending on data centers, utilities, and applications. So, the crucial question is: What $1tn problem will AI solve? Replacing low- wage jobs with tremendously costly technology is basically the polar opposite of the prior technology transitions I’ve witnessed in my thirty years of closely following the tech industry"
Low vision users understand the limitations of accessibility technology better than anyone else. They will VERY quickly figure out what this tech can be used for effectively and what it can't.
Yes, exactly.
--
[0] - Like if you plugged a 3D game engine into the training loop.
[1] - Results of which we could hopefully reuse in training later. And yes, a simulation could itself be a recording of carefully executed experiment in real world.
In the sense that we can't fast-forward our offline training, sure, but humans certainly "go away and think about it" after gaining IRL experience. This process seems to involve both consciously and subconsciously training on this data. People often consciously think about recent experiences, run through imagined scenarios to simulate the outcomes, plan approaches for next time etc. and even if they don't, they'll often perform better at a task after a break than they did at the start of the break. If this process of replaying experiences and simulating variants of them isn't "controlling the flow of (simulated) time" I don't know what else you'd call it.
Isn't this what synthesized embodiment basically always is? As long as the application of the resulting technology is in a restricted, well controlled environment, as is the case for example for an assembly-line robot, this is a great strategy. But I expect fidelity problems will make this technique ultimately a bad idea for anything that's supposed to interact with humans. Like self-driving cars, for example. Unless, again, those self-driving cars are segregated in dedicated lanes.
> a multi step method that iteratively corrects and re embeds text is able to recover 92% of 32-token text inputs exactly
https://arxiv.org/abs/2310.06816
So it's probably 1 sentence == 1 embedding
I agree this research is really interesting, but I didn't have an a priori expectation of what token prediction could accomplish, so my reaction to a lot of the claims and counterclaims of this new tech is that it's good at fooling people and giving plausible but baseless results. It makes for good research but dangerous in the hands of a market attempting to exploit it.
I think people get really wrapped into the idea that a single model needs to be able to do all the things, and LLMs can do a _lot_, but there doesn't actually need to be a _one model to rule them all_. If VLMs are kind of okay at image intepretation but not great at details, we can supplement them with something that _can_ handle the details.
Counterpoint: real world is heavily sanitized towards things that don't trip human visual perception up too much, or otherwise inconvenience us. ML models are trained on that, and for that. They're not trained for dealing with synthetic images, that couldn't possibly exist in reality, and designed to trip visual processing algorithms up.
Also:
> People are rushing to build this AI into all kinds of products, and they actively don’t want to know where the problems are.
Glass half-full (of gasoline) take: those products will trip over real-world problems, identifying them in the process, and the models will get better walking over the corpses of failed AI-get-rich-quick companies. The people involved may not want to know where the problems are, but by deploying the models, they'll reveal those problems to all.
> Because software developers have no governing body, no oaths of ethics and no spine someone will end up dead in a ditch from malfunctioning AI.
That, unfortunately, I 100% agree with. Though AI isn't special here - not giving a fuck kills people regardless of the complexity of software involved.
Neither of these claims are true. ML is highly trained on synthetic images. In fact, synthetic data generation is the way forward for the scale is all you need people. And there are also loads of synthetic images out in the wild. Everything from line art to abstract nonsense. Just take a walk down town near the bars.
> not giving a fuck kills people regardless of the complexity of software involved.
What has me the most frustrated is that this "move fast break things and don't bother cleaning up" attitude is not only common in industry but also in academia. But these two are incredibly intertwined these days and it's hard to publish without support from industry because people only evaluate on benchmarks. And if you're going to hack your benchmarks, you just throw a shit ton of compute at it. Who cares where the metrics fail?
The conclusion and the premise are both true, but not the causality. On AI, the Overton window is mostly filled with people going "this could be very bad if we get it wrong".
Unfortunately, there's enough people who think "unless I do it first" (Musk, IMO) or "it can't possibly be harmful" (LeCun) that it will indeed kill more people than it already has.
The number who are already (and literally) "dead in a ditch" is definitely above zero if you include all the things that used to be AI when I was a kid e.g. "route finding": https://www.cbsnews.com/news/google-sued-negligence-maps-dri...
I expect that as the models grow more complicated so will their reasoning ability.
It really doesn't matter how good current llms are.
They have been good enough to start this ara.
And no it's not and never has been just llms. Look what Nvidia is doing with ml.
Whisper huge advantage, segment anything again huge. Alpha fold 2 again huge.
All the robot announcements -> huge
I doubt we will reach agi just through llms. We will reach agi through multi modal, mix of experts, some kind of feedback loop, etc.
But the stone started to roll.
And you know I prefer to hear about ai advantages for the next 10-30 years. That's a lot better than the crypto shit we had the last 5 years.
For exemple in the case of OCR, a person with myopia will usually be able to make up letters and words even without his glasses based on his expectation (similar to vlm training) of seeing letters and words in, say, a sign. He might not see them all clearly and do some errors but might recognize some letters easily and make up the rest based on context, words recognition, etc. Basically experience.
I also have a funny anecdote about my partner, which has sever myopia, who once found herself outside her house without her glasses on, and saw something on the grass right in front. She told her then brother in law "look, a squirrel" Only for the "squirrel" to take off while shouting its typical caws. It was a crow. This is typical of VLM's hallucinations.
So unless I missed something but as far as I am concerned, they are optimized for benchmarks.
So while I enjoy gen AI, image-to-text is highly subpart.
At least that's what I would do. Perhaps the script would be a "specialist model" in a sense.
Sometimes if you upload an image to ChatGPT and ask for OCR it will run Python code that executes Tesseract, but that's effectively a bug: GPT-4 vision works much better than that, and it will use GPT-4 vision if you tell it "don't use Python" or similar.
[1] https://huggingface.co/spaces/opencompass/open_vlm_leaderboa...
A modern robot would struggle to fold socks and put them in a drawer, but they're great at making cars.
The 2-4-6 game comes to mind. They may well have verified the AI will work, but it's hard to learn the skill of thinking about how to falsify a belief.
I'm talking about regular people, who actually use these tools for productive use, and can tell the models are up to tasks previously unachievable.
And yet... every interface to every LLM has a "ChatGPT can make mistakes. Check important info." style disclaimer.
The hype around this stuff may be deafening, but it's often not entirely the direct fault of the model vendors themselves, who even put out lengthy papers describing their many flaws.
Humans fuck up all the time.
Looking at the example patterns given:
MATCH
2, 4, 6
8, 10, 12
12, 14, 16
20, 40, 60
NOT MATCH
10, 8, 6
If the answer is "numbers in ascending order", then this is a perfect illustration of synthetic vs. realistic examples. The numbers indeed fit that rule, so in theory, everything is fine. In practice, you'd be an ass to give such examples on a test, because they strongly hint the rule is more complex. Real data from a real process is almost never misleading in this way[0]. In fact, if you sampled such sequences from a real process, you'd be better off assuming the rule is "2k, 2(k+1), 2(k+2)", and treating the last example as some weird outlier.Might sound like pointless nitpicking, but I think it's something to keep in mind wrt. generative AI models, because the way they're trained makes them biased towards reality and away from synthetic examples.
--
[0] - It could be if you have very, very bad luck with sampling. Like winning a lottery, except the prize sucks.
I'd say that every black swan is an example of a real process that is misleading.
But more than that, I mentioned verified/falsified, as in the difference between the two in science. We got a long way with just the first (Karl Popper only died in 1994), but it does seem to make a difference?