AI hype is built on flawed test scores(technologyreview.com) |
AI hype is built on flawed test scores(technologyreview.com) |
I suppose in the context of this article “AI” means statistical language models.
Whether ChatGPT has any of those things is questionable. From what I’ve seen, it has an unreliable latent representation, an unreliable understanding of possible moves, and a positional understanding that’s little better than a coin flip.
So, with this definition, these chess engines already exhibit (fairly substantial) reasoning? Or, are you saying it would be required, in the context of an LLM?
It is built on the observation how fast AI is getting better. If the speed of improvement stays anywhere near the level it was the last two years, then over the next two decades, it will lead to massive changes in how we work and which skills are valuable.
Just two years ago, I was mesmerized by GPT-3's ability to understand concepts:
https://twitter.com/marekgibney/status/1403414210642649092
Nowadays, using it daily in a productive fashion feels completely normal.
Yesterday, I was annoyed with how cumbersome it is to play long mp3s on my iPad. I asked GPT-4 something like "Write an html page which lets me select an mp3, play it via play/pause buttons and offers me a field to enter a time to jump to". And the result was usable out of the box and is my default mp3 player now.
Two years ago it didn't even dawn on me that this would be my way of writing software in the near future. I have been coding for over 20 years. But for little tools like this, it is faster to ask ChatGPT now.
It's hard to imagine where we will be in 20 years.
I don't think so. When you say "it's not capable of actually reasoning", that's because it's a LLM; and if it "changes in the future", that's because the new system must no longer be a pure LLM. The appearance of reasoning in LLMs is an illusion.
Rather than a binary it's much more likely that it's a mix of factors going into results that includes basic reasoning capabilities developed from the training data (much like board representations and state tracking abilities developed feeding board game moves into a toy model in Othello-GPT) as well as statistic driven autocomplete.
In fact often when I've seen GPT-4 get hung up with logic puzzle variations such as transparency, it tends to seem more like the latter is overriding the former, and changing up tokens to emoji representations or having it always repeat adjectives attached to nouns so it preserves variation context gets it over the hump to reproducible solutions (as would be expected from a network capable of reasoning) but by default it falls into the pattern of the normative cases.
For something as complex as SotA neural networks, binary sweeping statements seem rather unlikely to actually be representative...
Define reasoning. Because in my definition GPT 4 could reason without doubt. It definitely can't reason better than experts in the field, but it could reason better than say interns.
Your conclusion doesn't follow from your premise.
None of these models are trained to do their best on any kind of test. They're just trained to predict the next word. The fact that they do well at all on tests they haven't seen is miraculous, and demonstrates something very akin to reasoning. Imagine how they might do if you actually trained them or something like them to do well on tests, using something like RL.
If it’s a debate on the illusion of reasoning, I’d be careful how I step here, because it’s been found these things probably work so well because the human brain is also a biological real-time prediction machine and “just” guessing too: https://www.scientificamerican.com/article/the-brain-guesses...
This language embodies the anthropomorphic assumptions that the author is attacking.
We are in a Cambrian Explosion on the software side and hardware hasn’t yet reacted to it. There’s a few years of mad discovery in front of us.
People have different impressions as to the shape of the curve that’s going up and right, but only a fool would not stop and carefully take what is happening.
Making your own "internal family system" of AI's is a making this exponential (and frightening), like an ensemble on top of the ensemble, with specific "mindsets", that with shared memory can build and do stuff continuously. Found this from a comp sci professor on Tiktok so be warned: https://www.tiktok.com/@lizthedeveloper/video/72835773820264...
I remember a couple of comments here on HN when the hype began about how some dude thought he had figured out how to actually make an AGI - can't find it now, but it was something about having multiple ai's with different personalities discoursing with a shared memory - and now it seems to be happening.
This coupled with access to linux containers that can be spawned on demand, we are in for a wild ride!
That's a big assumption to make. You can't assume that the rate of improvement will stay the same, especially over a period of 2 decades, which is a very long time. Every advance in technology hits diminishing returns at some point.
Technological progress seems rather accelerated than diminishing to me.
Computers are a great example: They have been getting more capable exponentially over the last decades.
In terms of performance (memory, speed, bandwidth) and in terms of impact. First we had calculators, then we had desktop applications, then the Internet and now we have AI.
And AI will help us get to the next stage even faster.
I simply don't see it a being the same today. The obvious element of scaling or techniques that imply a useful overlap isn't there. Whereas before researchers brought together excellent and groundbreaking performance on different benchmarks and areas together as they worked on GPT-3, since 2020, except instruction following, less has been predictable.
Multi modal could change everything (things like the ScienceQA paper suggest so), but also, it might not shift benchmarks. It's just not so clear that the future is as predictable or will be faster than the last few years. I do have my own beliefs similar to Yann Lecun about what architecture or rather infrastructure makes most sense intuitively going forward, and there's not really the openness we used to have from top labs to know if they are going these ways, or not. So you are absolutely right that it's hard to imagine where we will be in 20 years, but in a strange way, because it is much less clear than in 2020 where we will be in 3 years time onwards, I would say it is much less guaranteed progress than it is felt by many...
The way that LLMs and humans "think" is inherently different. Giving an LLM a test designed for humans is akin to giving a camera a 'drawing test.'
A camera can make a better narrow final output than a human, but it cannot do the subordinate tasks that a human illustrator could, like changing shadings, line width, etc.
An LLM can answer really well on tests, but it often fails at subordinate tasks like 'applying symbolic reasoning to unfamiliar situations.'
Eventually the thinking styles may converge in a way that makes the LLMs practically more capable than humans on those subordinate tasks, but we are not there yet.
AI is getting subjectively better, and we need better tests to figure out if this improvement is objectively significant or not.
OpenAI is reportedly losing 4 cents per query. With a thousandfold increase in model size, and assuming linear scale in cost, that's a problem. Training time is going to go up too. Moore's law isn't going to help any more. Algorithmic improvements may help...if any significant ones can be found.
Training a model on more data improves generalization not memorization.
To store more information in the same number of parameters requires the commonality between examples to be encoded.
In contrast, the less data trained on, especially if repeated, lets the network learn to provide good answers for that limited set without generalizing. I.e. memorizing.
——
It’s the same as with people. The more variations people see of something, the more likely they intuit the underlying pattern.
The fewer examples, the more likely they just pattern match.
If it had perfect recall I would be so thrilled.
And just because it's memorized the data--as all intelligences would need to do to spit data out--doesn't mean it can't still do useful operations on the data, or explain it in different words, or whatever a human might do with it.
https://chat.openai.com/share/29d695e6-7f23-4f03-b2be-29b7c9...
Do you (or anyone) know of any products that allow for iterating on the generated output through further chatting with the ai? What I mean, is that each subsequent prompt here either generated a new whole output, or new chunks to add to the output. Ideally, whether generating code or prose, I’d want to keep prompting about the generated output and the AI further modifies the existing output until it’s refined to the degree I want it.
Or is that effectively what Copilot/cursor do and I’m just a bad operator?
So you were ignorant two years ago, GitHub Copilot was already available to users back then. The only new big thing the past two years was GPT-4, and nothing suggest anything similar will come the next two years. There are no big new things on the horizon, we knew for quite a while that GPT-4 was coming, but there isn't anything like that this time.
But when Copilot came out, I was indeed ignorant! I remember when a friend showed it to me for the first time. I was like "Yeah, it outputs almost correct boilerplate code for you. But thankfully my coding is so that I don't have to write boilerplate". I didn't expect it to be able to write fully functional tools and understand them well enough to actually write pretty nice code!
Regarding "there isn't anything like that this time." : Quite the opposite! We have not figured out where using larger models and throwing more data at them will level off! This could go on for quite a while. With FSD 12, Tesla is already testing self driving with a single large neural net, without any glue code. I am super curious how that will turn out.
The whole thing is just starting.
Other breakthroughs in graph machine learning https://towardsdatascience.com/graph-ml-in-2023-the-state-of...
At one point, they showed some old footage which featured a montage of daily life in a small Mississippi town. You'd see people shopping for groceries, going on walks, etc. Some would stop and wave at the camera.
In the documentary, they noted that this footage exists because at the time, they'd show it on screen during intermission at movie theaters. Film was still in its infancy in that time, and was so novel that people loved seeing themselves and other people on the big screen. It was an interesting use of a new technology, and today it's easy to understand why it died out. Of course, it likely wasn't obvious at the time.
I say all that because I don't think we can know at this point what AI is capable of, and how we want to use it, but we should expect to see lots of failure while we figure it out. Over the next decade there's undoubtedly going to be countless ventures similar to the "show the townspeople on the movie screen" idea, blinded by the novelty of technological change. But failed ventures have no relevance to the overall impact or worth of the technology itself.
I think it's probably more sociological than technical. People love to see themselves and their friends/family. My work has screens that show photos of events and it always causes a bit of a stir ("Did you see X's photo from the summer picnic?") Yearbooks are perennially popular and there's a whole slew of social media.
However, for this to be "fun", there must be a decent chance that most people in the audience know a few people in a few of the pictures. I can't imagine this working well in a big city, for example, or a rural theatre that draws from a huge area.
The custom of showing film consisting of footage of the general public in movie theaters.
LLMs seem to use little or no abstract reasoning (is-a) or hierarchical perception (has-a), as humans do -- both of which are grounded in semantic abstraction. Instead, LLMs can memorize a brute force explosion in finite state machines (interconnected with Word2Vec-like associations) and then traverse those machines and associations as some kind of mashup, akin to a coherent abstract concept. Then as LLMs get bigger and bigger, they just memorize more and more mashup clusters of FSMs augmented with associations.
Of course, that's not how a human learns, or reasons. It seems likely that synthetic cognition of this kind will fail to enable various kinds of reasoning that humans perceive as essential and normal (like common sense based on abstraction, or physically-grounded perception, or goal-based or counterfactual reasoning, much less insight into the thought processes / perceptions of other sentient beings). Even as ever-larger LLMs "know more" by memorizing ever more FSMs, I suspect they'll continue to surprise us with persistent cognitive and perceptual deficits that would never arise in organic beings that do use abstract reasoning and physically grounded perception.
That's actually the closest to a working definition of what a concept is. The discussion about language representation has little bearing on humans or intelligence, because it's not how we learn and use language. Similarly, the more people - be it armchair or diploma-carrying philosophers - try to find the essence of a meaning of some word, the more they fail, because it seems that meaning of any concept is defined entirely through associations with other concepts and some remembered experiences. Which again seems pretty similar to how LLMs encode information through associations in high-dimensional spaces.
It _is_ correct to say that an LLM is not ready to be a medical doctor, even if it can pass the test.
But I think a better conclusion is that test scores don’t help us understand LLM capabilities like we think they do.
Using a human test for an LLM is like measuring a car’s “muscles” and calling it horsepower. They’re just different.
But the AI hype is justified, even if we struggle to measure it.
That's why I'm hyped. If it's that good for me, and it's generalizable, then it's going to rock the world.
I am currently transliterating a language PDF into a formatted lexicon, I wouldn't even be able to do this without co-pilot, it has turned this seemingly impossibly arduous task into a pleasurable one.
One is just to wow factor. It will be short lived. A bit like VR, which is awesome when you first try it, but it wears out quickly. Here, you can have a bot write convincing stories and generate nice looking images, which is awesome until you notice that the story doesn't make sense and that the images has many details wrong. This is not just a score, it is something you can see and experience.
And there is also the real thing. People start using GPT for real work. I have used it to document my code for instance, and it works really well, with it I can do a better job than without, and I can do it faster. Many students use it to do their homework, which may not be something you want, but it no less of a real use. Many artists are strongly protesting against generative AI, this in itself is telling, it means it is taken seriously, and at the same time, other artists are making use of it.
It is even use for great effect where you don't notice. Phone cameras are a good example, by enhancing details using AI, they give you much better pictures than what the optics are capable of. Some people don't like that because the picture are "not real", but most enjoy the better perceived quality. Then, there are image classifiers, speech-to-text and OCR, fuzzy searching, content ranking algorithms we love to hate, etc... that all make use of AI.
Note: here AI = machine learning with neural networks, which is what the hype is about. AI is a vague term that can mean just about anything.
He is of the opinion the current generation transformers architecture is flawed and it will take a new generation of models to get close to the hype.
The tests are used (and, despite their flaws, useful) to compare various facets of model A to model B - however, the validation whether a model is good now comes from users, and that validation really can't be flawed much - if it's helpful (or not) to someone, then it is what it is, the proof of the pudding is in the eating.
> But when a large language model scores well on such tests, it is not clear at all what has been measured. Is it evidence of actual understanding? A mindless statistical trick? Rote repetition?
It is measuring how well it does _at REPLACING HUMANS_. It is hard to believe how the author clearly does not understand this. I don't care how it obtains its results.
GPT-4 is like a hyperspeed entry to mid level dev that has almost no ability to contextualize. Tools built on top of 32k will allow repo ingestion.
This is the worst it will ever be.
It's possible to do well on a test and have no ability to do the thing the job tests for.
GPT-4 scores well on an advanced sommelier exam, but obviously cannot replace a human sommelier, because it does not have a mouth.
Also an aside:
> This is the worse it will ever be.
I hear this a lot and it really bothers me. Just because something is the worst it’ll ever be doesn’t mean it’ll get much better. There could always be a plateau on the horizon.
It’s akin to “just have faith.” A real weird sentiment that I didn’t notice in tech before 2021.
Lots of things usefully correlate with test scores in humans but might not in an AI.
Whenever there's a news or article noting the limits of current LLM tech (especially the GPT class of models from OpenAI), there's always a comment that says something along the lines of "ah did you test it on GPT-4"?
Or if it's clear that it's the limitation of GPT-4, then you have comments along the lines of "what's the prompt?", or "the prompt is poor". Usually, it's someone who hasn't in the past indicated that they understand that prompt engineering is model specific, and the papers' point is to make a more general claim as opposed to a claim on one model.
Can anyone explain this? It's like the mere mention of LLMs being limited in X, Y, Z fashion offends their lifestyle/core beliefs. Or perhaps it's a weird form of astroturfing. To which, I ask, to what end?
I agree there has been many attention-grabbing headlines that are due to simple issues like contamination. However, I think AI has already proved its business value far beyond those issues, as anyone using ChatGPT with a code base not present in their dataset can attest.
It seems pretty important to counter that and to debunk any wild claims such as these. To provide context and to educate the world on their shortcomings.
The article is actually fine and pretty balanced, but it is a bit unfortunate that 80% of their examples are not illustrative of current capabilities. At least for me, most of my optimism about the utility of LLM's comes from GPT-4 specifically.
I find the whole hype & anti-hype dynamic so tiresome. Some are over-hyping, others are responding with over-anti-hyping. Somewhere in-between are many reasonable, moderate and caveated opinions, but neither the hypesters or anti-hypesters will listen to these (considering all of them to come from people at the opposite extreme), nor will outside commentators (somehow being unable to categorize things as anything more complicated than this binary).
There is a possible world where AI will be a truly transformative technology in ways we can't possibly understand.
There is a possible world where this tech fizzles out.
So one of the reasons that there is a broad 'hype' dynamic here is because the range of possibilities is broad.
I sit firmly in the first camp though - I believe it's truly a transformative technology, and struggle to see the perspective of the 'anti-hype' crowd.
There are millions of hustlers out there pushing snake oil. The probability that something is the real deal and not snake oil is small. Better to assuming the glass is half empty.
I'm sure that is just a matter of prompt engineering, though.
No, it's built on people using DALLE and Midjourney and ChatGPT.
‘Pre-training on the Test Set Is All You Need‘
GPT-4 is really smart to dig information it has seen before, but please don’t use it for any serious reasoning. Always take the answer with a grain of salt.
You start with everyone knows there's AI hype from tech bros. Then you introduce a PhD or two at institutions with good names. Then they start grumbling about anthropomorphizing and who knows what AI is anyway.
Somehow, if it's long enough, you forget that this kind of has nothing to do with anything. There is no argument. Just imagining other people must believe crazy things and working backwards from there to find something to critique.
Took me a bit to realize it's not even an argument, just parroting "it's a stochastic parrot!" Assumes other people are dunces and genuinely believe it's a minihuman. I can't believe MIT Tech Review is going for this, the only argument here is the tests are flawed if you think they're supposed to show the AI model is literally human.
The hype is based entirely on the fact that I can talk (in text) to a machine and it responds like a human. It might sometimes make up stuff, but so do humans. I therefore don't consider that a significant downside, or problem. In the end chatgpt is still ... a baby.
The hype builds around the fact that I can run a language model that fits into my graphics cards and responds at faster-than-typing speed, which is sufficient.
The hype builds around the fact that it can create and govern whole text based games for me, if I just properly ask it to do so.
The hype builds around the fact that I can have this everywhere with me, all day long, whenever I want. It never grows tired, it never stops answering, it never scoffs at me, it never hates me, it never tells me that I'm stupid, it never tells me that I'm not capable of doing something.
It always teaches me, always offers me more to learn, it always is willingly helping me, it never intentionally tries to hide the fact that it doesn't know something and never intentionally tries to impress me just to get something from me.
Can it get things wrong? Sure! Happens! Happens to everyone. Me, you, your neighbour, parents, teachers, plumbers.
Not a single minute did I, or dozens of millions of others, give a single flying fuck about test scores.
I wasn't sure that the phenomena they discussed was as relevant to the question of whether AI is overhyped as they made it out to be, but I did think a lot of questions about the meaning of the performances were important.
What's interesting to me is you could flip this all on its head and, instead of asking "what can we infer about the machine processes these test scores are measuring?", we could ask "what does this imply about the human processes these test scores are measuring?"
A lot of these test are well-validated but overinterpreted I think, and leaned on too heavily to make inferences about people. If a machine can pass a test, for instance, what does it say about the test as used in people? Should we be putting as much weight on them as we do?
I'm not arguing these tests are useless or something, just that maybe we read into them too much to begin with.
"25% of the potential target audience dislikes AI and do not have their opinion positively represented in the media they consume. The potential is unsaturated. Maximum saturation estimated at 15 articles per week."
A bit more serious: AI hasn't even scratched the surface. Once we apply LLMs to speech synth and improve the visual generators by just a tiny bit, to fix faces, we can basically tell the AI to "create the best romantic comedy ever made".
"Oh, and repeat 1000 times, please".
The ones who have to dismantle the hype are the proper technologies such as Yann LeCun and Grady Booch who know exactly what they are talking about.
“People have been giving human intelligence tests—IQ tests and so on—to machines since the very beginning of AI,” says Melanie Mitchell, an artificial-intelligence researcher at the Santa Fe Institute in New Mexico. “The issue throughout has been what it means when you test a machine like this. It doesn’t mean the same thing that it means for a human.”
The last sentence above is an important point that most people don't consider.It's not an apples/apples comparison. The nature of the capability profile of a human vs. any known machine is radically different. Machines are intentionally designed to have extreme peaks of performance in narrow areas. Present-generation AI might be wider in its capabilities than what we've previously built, but it's still rather narrow as you quickly discover if you start trying to use it on real tasks.
I agree completely with you on this.
In defence of the executives however is that some businesses will be seriously affected. Call centres and plagiarism scanner have already been affected, but it’s unclear which industries will be affected too. Maybe the probability is low, but the impact could be very high. In think this reasoning is driving the executives.
I’ve said it before, but as someone to whom “AI” means something more than making API calls to some SAAS, I look forward to the day they hire me at $300/hour to replace their “AI strategy” with something that can be run locally off of a consumer-grade GPU or cheaper.
It started a few years back and it is now really inflamed with LLM, because of the consumer level hype and general media reporting about it.
You can perceive that by the multiple AI startups capturing millions in VC capital for absolutely bogus value proposition. Bizarre!
While I agree with you in general, I don't think this bit is particularly fair. I'd say we know the limitations, and we also know that using LLMs might bring some advantage, and the companies that are able to use it properly will have a better position, so it makes sense to at least investigate the options.
This only appears so because we here have some insight into the domain. But there have always been hype cycles. We just didn't notice them so readily.
The speed with which this happens makes me suspect there is a hidden "generic hype army" that was already in place, presumably hyping the last thing, and ready to jump on this thing.
In Capitalism, you grow or you die and sometimes you need to bullshit people about growth potential to buy yourself time
They put the test scores front and center in the initial announcement with a huge image showing improvements on AP exams, it was the main thing people talked about during the announcement and the first thing anyone who read anything about gpt-4 sees.
I don't think many who are hyped about these things missed that.
I seriously don't remember hearing these test results being mentioned in any casual conversation, and I heard a lot of casual conversations about AI. The majority of these center around personal experiences ("I asked ChatGPT this and I got that..."), homework is another common topic. When we compare systems, we won't say "this one got a 72 and the other got a 94", but more like "I asked new system to give me a specific piece of code (or cocktail recipe, or anything) and the result is much better". Again, personal experience and anecdotes before scores.
Maybe people in the field hype themselves with score, but not the general public, and probably not the investors either, who will most likely look at the financial performance of the likes of OpenAI instead.
Perhaps because whenever there's "a news or article noting the limits of current LLM tech", it's a bit like someone tried to play a modern game on a machine they found in their parents' basement, and the only appropriate response to this is, "have you tried running it on something other than a potato"? This has been happening so often over the past few months that it's the first red flag you check for.
GPT-4 is still qualitatively ahead of all other LLMs, so outside of articles addressing specialized aspects of different model families, the claims are invalid unless they were tested on GPT-4.
(Half the time the problem is that the author used ChatGPT web app and did not even realize there are two models and they've been using the toy one.)
Expect the model to continue to perform like it does today, and then lots of dumb integrations added to it, and you will get a very accurate prediction of how most of new tech hype turns out. Dumb integrations can't add intelligence, but it can add a lot of value, so the rational hype still sees this as a very valuable and exciting thing, but it isn't a complete revolution in its current form.
So my perception is this leads to people who have good luck and perceive LLMs as near AGI because it arrives at a useful answer more often than not, and these people cannot believe there are others who have bad luck and get worthless output from their LLM, like someone at a roulette table exhorting "have you tried betting it all on black? worked for me!"
2. LLMs, in spite of the complaints about the research leaders, are fairly democratic. I have access to several of the best LLMs currently in existence and the ones I can't access haven't been polished for general usage anyway. If you make a claim with a prompt, it's easy for me to verify it
3. I've been linked legitimate ChatGPT prompts where someone gets incorrect data from ChatGPT - my instinct is to help them refine their prompt to get correct data
4. If you make a claim about these cool new tools (not making a claim about what they're good for!) all of these kick in. I want to verify, refine, etc.
Of course some people are on the bandwagon and it is akin to insulting their religion (it is with religious fervor they hold their beliefs!) but at least most folks on hn are just excited and trying to engage
^^ I actually think making this claim is in bad form generally. It's like looking for the existence of aliens on a planet. Absence of evidence is not evidence of absence
If you are trying to make categorical statements about what AI is unable to do, at the very least you should use a state-of-the-art system, which conveniently is easily available for everyone.
It's a weird thing to get hung up on if you ask me.
Yes, but the models we're talking about have been trained specifically on the task of "complete arbitrary textual input in a way that makes sense to humans", for arbitrary textual input, and then further tuned for "complete it as if you were a person having conversation with a human", again for arbitrary text input - and trained until they could do so convincingly.
(Or, you could say that with instruct fine-tuning, they were further trained to behave as if they were an AI chatbot - the kind of AI people know from sci-fi. Fake it 'till you make it, via backpropagation.)
In short, they've been trained on an open-ended, general task of communicating with humans using plain text. That's very different to typical ML models which are tasked to predict some very specific data in a specialized domain. It's like comparing a Python interpreter to Notepad - both are just regular software, but there's a meaningful difference in capabilities.
As for seeing glimpses of understanding in SOTA LLMs - this makes sense under the compression argument: understanding is lossy compression of observations, and this is what the training process is trying to force to happen, squeezing more and more knowledge into a fixed set of model weights.
Why I think AI is not the appropriate term is that if it were AI, the AI would have already figured everything out for us (or for itself). LLM can only chain text, it does not really understand the content of the text, and can't come up with new novel solutions (or if it accidentally does, it's due to hallucination), this can be easily confirmed by giving current LLMs some simple puzzles, math problems and so on.. Image models have similar issues.
https://en.wikipedia.org/wiki/AI_effect
Just because you don't like how poorly the term AI is defined, doesn't mean it is the wrong term.
AI can never be well defined because the word intelligence itself is not well defined.
Haven’t had it make anything usable that’s more complicated than a mad lib yet
ChatGPT does this.
Copilot, at least from what little I did in vscode, isn't as powerful as this. I think there's a GPT4 mode for it that I haven't played with that'd be a lot closer to this.
Some businesses in some industries can follow a strategy of "never do anything until it's a well established process", others cannot.
“The length of a film should be directly related to the endurance of the human bladder.” - Alfred Hitchcock
Trying to extrapolate actual progress is bad in itself, but trying to extrapolate your perceived progress is even worse.
More likely this will become the new “search” technology and get polluted with ads. People will lose trust and it will decay.
For the specific example the OP cited "War: like being eaten by a dragon and then having it spit you out"
then unless its answer to "Where were you in between being eaten by a dragon and before it spat you out?" is "in the dragon's digestive system" that isn't understanding.
And I'm curious to see it answer "Dragons only exist in mythology; does your analogy mean war doesn't exist either? Why not compare to an animal that exists?"
This exact text, and the response (several attempts) is flagged and censored, with ChatGPT-4 web interface. :-|
Sure, I can compare it to a real animal such as bears. Just like dragons are depicted as fearsome and destructive creatures in mythology, bears are also portrayed as powerful and dangerous animals in many cultures. However, just as the existence of dragons is a matter of fiction, the concept of a "war on bears" would also be nonsensical.
In reality, the issue of war is complex and multifaceted, involving political, economic, and societal factors. It cannot be reduced to a simplistic battle between humans and a particular animal species.
Furthermore, the use of animals in analogies can be problematic as it may trivialize the gravity and complexity of real-world issues. Wars and conflicts have serious consequences for human lives, whereas animals do not possess the same agency or ability to wage war.
In conclusion, while dragons may serve as a metaphor for war in mythology, comparing war to a fictional creature diminishes the seriousness of the topic and fails to capture the complexities of real conflicts.
It would demonstrate basic reasoning skills that weren't things one would "reproduce, mix and match text from an enormous corpus". Like the response you provided, which is meaningless word salad. It's a prima facie takedown of your post.
This is like people who hate poetry, insisting their bad poetry is good poetry. Why? Because who else is to say otherwise! Well, the good poets. The people that appreciate poetry will know the difference. Everyone else wont care, save for those invested in having to sell their bad poetry as good.
Personally I think it's priced perfectly - it's a really good typing assistant for obvious code, and helps me stay in flow longer.
In fact I'd pay double for a version with half the latency.
Can you measure the bottom line impact of using CI/CD, IDEs, static code analysis, source control, whatever tool ? If you don't know the exact numbers and are just guesstimating - are you actually accounting for the costs or just moaning because you don't like the tool ? Who even works with exact ROI numbers for these kinds of decisions ? I can't think of a scenario where accurately determining the ROI of any one thing is possible and it doesn't reduce to gut checks. Pretending it can be measured sounds as naive as people trying to measure developer productivity with fixed metrics.
Cost of Copilot is so low that it's under discretionary spending - it would take more time to figure out the actual value than to pay for people that want it. People already figured out that it's better to just allocate a budget to individuals, let them decide which tools work for them and go trough purchase requisition and approval dance for big ticket/external dependency items where the impact is worth the time spent on making the decision.
The invention of the PC market was filled with hustlers but that doesn't mean that the PC didn't match the hype.
The .com boom was filled with hustlers, but that doesn't mean that the Internet wasn't transformative.
Actual real world results... well the technology is already responsible for c40% of code on Github. Image recognition technologies are soaring and self driving feels within reach. Few people doubt that a real-world Jarvis will be in your home within 12 months. The turing test is smashed, and LLM's are already replacing live chat operatives. And this is just the start of the technology...
But a lot of .com projects were BS. If you were to pick at random, the probability you got a winner is low. Thus it’s wise to be skeptical of all hyped stuff until they have proven themselves.
> Actual real world results... well the technology is already responsible for c40% of code on Github.
Quite sure you misread that article. It says 40% of the code checked in by people who use Copilot is AI-generated. Not 40% of all code.
That’s how some programmers are I guess. I have heard of people copy pasting code directly from stack overflow without a second thought about how it works. That’s probably Copilot’s audience.
Are we really saying that people who were saying the internet was a transformative technology in the mid-1990's were wrong? It was transformative, but it was hard to see which parts of the technology would stick around. Of course it doesn't mean that every single company and investment was going to be profitable, that's not true of anything ever. People investing in Amazon and Google were winners though - these are companies that have in many ways reinvented the market they operate in.
> Quite sure you misread that article. It says 40% of the code checked in by people who use Copilot is AI-generated. Not 40% of all code.
Ok, I'll take that it's 40% of Copilot users. That's still 40% of some programmers code!
How do you know GPT-4 wasn't trained to do well on these tests? They didn't disclose what they did for it, so you can't say it wasn't trained to do well on these tests. That could be the magic sauce for it.
That is the learning algorithm.
The algorithm they learn, in response, is quite different. Since that learned algorithm is based on the training data.
In this case the models learn to sensibly continue text or conversations. And they are doing it so well it’s clear they have learned to “reason” at an astonishing level.
Sometimes, not as good as a human.
But in a tremendous number of ways they are better.
Try writing an essay about the many-worlds interpretation of the quantum field equation, from the perspective of Schrödinger, with references to his personal experiences, using analogies with medical situations, formatted as a brief for the Supreme Court, in Dr. Seuss prose, in a random human language of choice.
In real time.
While these models have some trouble with long chains of reasoning, and reasoning about things they don’t have experiences (different modalities, although sometimes they are surprisingly good), it is clear that they can also reason combining complex information drawn from there whole knowledge base much faster and sensibly than any human has ever come close to.
Where they exceed us, they trounce us.
And where they don’t, it’s amazing how fast they are improving. Especially given that year to year, biological human capabilities are at a relative standstill.
——
EDIT: I just tried the above test. The result was wonderful whimsical prose and references, that made sense at a very basic level, that a Supreme Court of 8 year olds would likely enjoy, especially if served along with some Dr. Seuss art! In about 10-15 seconds.
Viewed as a solution to an extremely complex constraint problem, that is simply amazing. And far beyond human capabilities on this dimension.
A strong hint to what they focused on in their training process is what metrics they used in their marketing of the model. You should always bet on models being optimized to perform on whatever metrics they themselves give you when they market the model. Look at the gpt-4 announcement, what metrics did they market? So what metrics should we expect they optimized the model for?
Exam results are the first metric they mentions, so exams was probably one of their top priorities when they trained gpt-4.
Haven't they seen these tests?
We know little to nothing of how these models get trained.
> The fewer examples, the more likely they just pattern match.
A kid who uses a calculator and just fills in the answer to every question will see a lot more examples than a kid that learned by starting from simple concepts and understanding each step. But the kid who focused on learning concepts and saw way fewer problems will obviously have a better understanding here.
So no, you are clearly wrong here, humans doesn't learn that way at all. These models learn that way, you are right on that, but humans don't.
In neither case did I introduce one.
And since the calculator itself has already a general understanding, it would seem completely counter productive to start training a computer or child by first giving them a machine that has already solved the problem.
Also, for what it’s worth, I am speaking from many years experience not just training models but creating the algorithms that train them.
To make a human understand we need to explain how things work to them. You don't just show examples. A human who is just shown a lot of examples wont understand much at all, even if he tries to replicate them.
> Also, for what it’s worth, I am speaking from many years experience not just training models but creating the algorithms that train them.
What does this has to do with how humans learn?
Yes, many people reason based on pure pattern-matching and repeat opinions not because they've reasoned them but because they're what they've absorbed from other sources, but even the world's most unreasoned human being with at least functional cognition still uses an enormous amount of constant, daily, hourly self-directed decision-making for a vast variety of complex and simple, often completely spontaneous scenarios and tasks in ways that no machine we've yet built on Earth does or could.
Moreover, even when some humans say or "believe" things based on nothing more than what they've absorbed from others without really considering it in depth, they almost always do so in a particularly selective way that fits their cognitive, emotional and personal predispositions. This very selectiveness is a distinctly conscious trait of a self-aware being. Its something LLM's don't have as far as I've yet seen.
I guess what I'm really asking, what would you expect to observe to make it not illusory?
Bullshit is a good case to consider, actually. What is the relationship between bullshit and reasoning? You could argue that bullshit is fallacious reasoning, "pseudo-reasoning" based on incorrect rules of inference.
But these models don't use any rules of inference; they produce output that resembles the result of reasoning, but without reasoning. They are trained on text samples that presumably usually are the result of human reasoning. If you trained them on bullshit, they'd produce output that resembled fallacious reasoning.
No, I don't think the touchstone for actual reasoning is a human mind. There are machines that do authentic reasoning (e.g. expert systems), but LLMs are not such machines.
These models have no capacity to plan ahead, which is a requirement for many "reasoning" problems. If it's not in the context, the model is unlikely to use it for predicting the next token. That's why techniques like chain-of-thought are popular; they cause the model to parrot a list of facts before making a decision. This increases the likelihood that the context might contain parts of the answer.
Unfortunately, this means the "reasoning" exhibited by language models is limited: if the training data does not contain a set of generalizable text applicable to a particular domain, a language model is unlikely to make a correct inference when confronted with a novel version of a similar situation.
That said, I do think adding reasoning capabilities is an active area of research, but we don't have a clear time horizon on when that might happen. Current prompting approaches are stopgaps until research identifies a promising approach for developing reasoning, e.g. combining latent space representations with planning algorithms over knowledge bases, constraining the logits based on an external knowledge verifier, etc (these are just random ideas, not saying they are what people are working on, rather are examples of possible approaches to the problem).
In my opinion, language models have been good enough since the GPT-2 era, but have been held back by a lack of reasoning and efficient memory. Making the language models larger and trained on more data helps make them more useful by incorporating more facts with increased computational capacity, but the approach is fundamentally a dead end for higher level reasoning capability.
I'm curious where you are drawing your definition or scope for 'reasoning' from?
For example, in Shuren The Neurology of Reasoning (2002) the definition selected was "the ability to draw conclusions from given information."
While I agree that LLMs can only process token to token and that juggling context is critical to effective operation such that CoT or ToT approaches are necessary to maximize the ability to synthesize conclusions, I'm not quite sure what the definition of reasoning you have in mind is such that these capabilities fall outside of it.
The typical lay audience suggestion that LLMs cannot generate new information or perspectives outside of the training data isn't the case, as I'm sure you're aware, and synthesizing new or original conclusions from input is very much within their capabilities.
Yes, this has to happen within a context window and occurs on a token by token basis, but that seems like a somewhat arbitrary distinction. Humans are unquestionably better at memory access and running multiple subprocesses on information than an LLM.
But if anything, this simply suggests that continuing to move in the direction of multiple pass processing of NLP tasks with selective contexts and a variety of fine tuned specializations of intermediate processing is where practical short term gains might lie.
As for the issue of new domains outside of training data, I'm somewhat surprised by your perspective. Hasn't one of the big research trends over the past twelve months been that in context learning has proven more capable than was previously expected? I'd agree that a zero shot evaluation of a problem type that isn't represented in a LLMs training data is setting it up for failure, but the capacity to extend in context examples outside of training data has proven relatively more successful, no?
Is it not possible that this is essentially how our brains do it too? Attempt to plan by branching out to related ideas until they contain an answer. Any of these statements that AI can't be on track to reason like a human because of X seem to come with an implication that we have such a good model of the human brain that we know it doesn't X. But I'm not an expert on neuroscience so in many of these cases maybe that implication is true.
Is that how you think? Just curious
True. But look at the Phi-1.5 model - it punches 5x above its weight limit. The trick is in the dataset:
> Our training data for phi-1.5 is a combination of phi-1’s training data (7B tokens) and newly created synthetic, “textbook-like” data (roughly 20B tokens) for the purpose of teaching common sense reasoning and general knowledge of the world (science, daily activities, theory of mind, etc.). We carefully selected 20K topics to seed the generation of this new synthetic data. In our generation prompts, we use samples from web datasets for diversity. We point out that the only non-synthetic part in our training data for phi-1.5 consists of the 6B tokens of filtered code dataset used in phi-1’s training (see [GZA+ 23]).
> We remark that the experience gained in the process of creating the training data for both phi-1 and phi-1.5 leads us to the conclusion that the creation of a robust and comprehensive dataset demands more than raw computational power: It requires intricate iterations, strategic topic selection, and a deep understanding of knowledge gaps to ensure quality and diversity of the data. We speculate that the creation of synthetic datasets will become, in the near future, an important technical skill and a central topic of research in AI.
https://arxiv.org/pdf/2309.05463.pdf
Synthetic data has its advantages - less bias, more diverse, scalable, higher average quality. But more importantly, it can cover all the permutations and combinations of skills, concepts, situations. That's why a small model just 1.5B like Phi was able to work like a 7B model. Usually at that scale they are not coherent.
How would you describe the behavior of "GPT Advanced Data Analysis"?
By the relative mix of training data, additional fine tuning training phases, and/or pre-prompts that give the model extra guidance relative to particular task types.
None in principle, at least if you take the common definition of bullshit as saying things for effect, without caring whether they're true or false.
Fallacious reasoning will make you wrong. No reasoning will make you spew nonsense. Truth and lies and bullshit, all require reasoning for the structure of what you're saying to make sense, otherwise it devolves to nonsense.
> But these models don't use any rules of inference
Neither do we. Rules of inference came from observation. Formal reasoning is a tool we can employ to do better, but it's not what we naturally do.
Maybe splitting hairs, but I’d argue that the bullshitter is reasoning about what sounds good, and what sounds good needs at least some shared assumptions and resulting logical conclusion to hang its hat on. Maybe not always, but enough of the time that I would still consider reasoning to be a key component of effective bullshit.
A desert mirage in the distance is an illusion; to the observer, it's indistinguishable from an oasis. You can only tell that it's a mirage by investigating how the appearance was created (e.g. by dragging your thirsty ass through the sand, to the place where the oasis appeared to be).
The illusion happens when, clearly, the alleged reasoning behind how such a system comes to be is based on prior knowledge of the system as a whole. Meaning, its construction/source was within the training data.
My opinion is it isn't binary, rather it's a scale. Your example is a point on the scale higher than what it is now.
But perhaps that's too liberal a definition of "reasoning" , no idea.
We seem to move the goalposts on what constitutes human level intelligence as we discover the various capabilities exhibited in the animal kingdom. I wonder if it is/will be the same with AI
His output was indeed word-salad, but he was eloquent. His bullshit wasn't fallacious reasoning; it didn't even have the appearance of reasoning at all. He was just stringing together words and concepts that sound plausible. It was funny, because his audience knew (and were supposed to know) that it was nonsense.
LLMs are the same, except they're supposed to pretend that it isn't nonsense.
Yeah, I think you've got a good example that improves the analogy.
They learn their first words, how to walk, what a cat looks like from many perspectives, how to parse a visual scene, how to parse the spoken word, interpret facial expressions and body language, how different objects move, how different creatures behave, different materials feel, what things cause pain, what things taste like and how they make them feel, how to get what they want, how to climb, how not to fall, all by trial & example. On and on.
And yes, as we get older we get better and better at learning 2nd hand from others verbally, and when people have the time to show us something, or with tools other people already invented.
Like how a post-trained model picks up on something when we explain it via a prompt.
But that is not the kind of training being done by models at this stage. And yet they are learning concepts (pre-prompt) that, as you point out, you & I had to have explained to us.
Models don't learn by you telling them something, the model doesn't update itself. A human updates their model when you explain how something works to them, that is the main way we teach humans. Models don't update themselves when we explain how something works to them, that isn't how we train these models, so the model isn't learning its just evaluating. It would be great if we could train models that way, but we can't.
> Humans learn vast amounts of information from examples.
Yes, but to understand things in school those examples comes with an explanation of what happens. That explanation is critical.
For example, a human can learn to perform legal chess moves in minutes. You tell them the rules each piece has to follow and then they will make legal moves in almost every case. You don't do it by showing them millions of chess boards and moves, all you have to do is explain the rules and the human then knows how to play chess. We can't teach AI models that way, this makes human learning and machine learning fundamentally different still.
And you can see how teaching rules creates a more robust understanding than just showing millions of examples.
I am curious who taught you to recognize sounds, before you understood language, or how to interpret visual phenomena, before you were capable of following someone’s directions.
Or recognize words independent of accent, speed, pitch, or cadence. Or even what a word was.
Humans start out learning to interpret vast amounts of sensory information, and predictions of results of there physical motor movements, from a constant stream of examples.
Over time they learn the ability to absorb information indirectly from others too.
This is no different from models, except that it turns out, they can learn more things, at a higher degree of abstraction, just from example than us.
And work on their indirect learning (I.e. long term retention of information we give them via prompts), is just beginning.
But even as adults, our primary learning mode is experience is from the example situations we encounter non-stop as we navigate life.
Even when people explain things, we generalize a great deal of nuance and related implications beyond what is said.
“Show, don’t tell”, isn’t common advice for no reason. We were born example generalizers.
Then we learn to incorporate indirect information.
I do not know that much about AI but I know at least something about cognitive psychology and it seems to me that a lot of claims about LLMs "not actually reasoning" and similar are probably made by CS graduates who have unreflected assumptions about how human thinking works.
I don't claim to know how human thinking works but if there is one thing I would conclude from studying psychology and knowing at least some basics about neuroscience, it would be that "it's not how it appears to us".
Nobody knows how human reasoning actually works but if I had to guess (based on my amateurish mental model of the functioning of the human brain), I would say that it is probably a lot closer to LLMs and a lot less rational than is commonly assumed in discussions like this one.
If you want to follow this more closely, I'd recommend the work of Evelina Fedorneko a cognitive neuroscientist at MIT who specializes in language understanding.
Check out these talks for more details: https://youtu.be/TsoQFZxrv-I?t=580 https://youtu.be/qublpBRtN_w
What this means in the context of LLMs is that next word prediction alone does not provide the breadth of cognitive capacity humans exhibit. Again, I'd posit GPT-2 is plenty capable as an LM, if combined with an approach to perform higher-level reasoning to guide language generation. Unfortunately, what that system is and how to design it currently eludes us.
Maybe I diverted your focus the wrong way when I used LLMs as an example - what if I used more general term "neural network"? I said LLMs because this thread is about LLMs but let me clarify what I meant:
The thing that interests me in this thread is the claim that LLMs are "not capable of actually reasoning". Whether you agree with it depends on your mental model of actual reasoning, right?
My model of reasoning: the fundamental thing about it is that I have a network of things. The signal travels through the network guided by the weight of connections between them and fires some pattern of the things. That pattern represents something. Maybe it is a word in the case of LLMs (or syllable or whatever the token actually is - let's ignore those details for now) or a thought in the case of my brain (I was not saying people reason in language) - the resulting "token" can be many things, I imagine (like some mental representation of objects and their positions in spatial reasoning) - those are the specifics, but "essentially", the underlying mechanism is the same.
In my mental model, there is nothing fundamental that distinguishes what LLMs do from the "actual reasoning". If you have enough compute and good enough training data, you can create LLM reasoning as well as humans - that is my default hypothesis.
If I understand your position, you would not agree with that, correct? I am not claiming you are wrong - I know way too little for that. I would just be really curious - what is your mental model of actual reasoning? What does it have that LLMs do not have?
I know you mentioned that "these models have no capacity to plan ahead" - I am not sure I understand what you mean by that. Is this not just a matter of training?
BTW, I have talked about this topic before and some people apparently see conscience as a necessary part of actual reasoning. I do not - do you?
For example, you shouldn't expect it to be able to make valid chess moves reliably, that requires reading and understanding rules which it can't do during training. It can get some understanding during evaluation, but we really want to be able to encode that understanding into the model itself rather than have to keep it in eval time.
There is a distinction between reasoning skills learned inductively (generalizing from examples), and reasoning learned deductively (via compact symbols or other structures).
The former is better at recognition of complex patterns, but can incorporate some basic deduction steps.
But explicit deduction, once it has been learned, is a far more efficient method of reasoning, and opens up our minds to vast quantities of indirect information we would never have the time or resources to experience directly.
Given how well models can do at the former, it’s going to be extremely interesting to see how quickly they exceed us at the latter - as algorithms for longer chains of processing, internal “whiteboarding” as a working memory tool for consistent reasoning over many steps and many facts, and long term retention of prompt dialogs, get developed!
"Say I have a container with 50 red balls and 50 blue balls, and every time I draw a blue ball from the container, I add two white balls back. After drawing 100 balls, how many of each different color ball are left in the container? Explain why."
... because on GPT 3.5 the answer begins like the below and then gets worse:
"Let's break down the process step by step:
Initially, you have 50 red balls and 50 blue balls in the container.
1) When you draw a blue ball from the container, you remove one blue ball, and you add two white balls back. So, after drawing a blue ball, you have 49 blue balls (due to removal) and you add 2 white balls, making it a total of 52 white balls (due to addition) ..."
If I was hiring interns this dumb, I'd be in trouble.
EDIT: judging by the GPT-4 responses, I remain of the opinion I'd be in trouble if my interns were this dumb.
So I clarified to ChatGPT that the drawing is random. And it replied: "The exact numbers can vary based on the randomness and can be precisely modeled with a simulation or detailed probabilistic analysis."
I asked for a detailed probabilistic analysis and it gives a very simplified analysis. And then basically says that a Monte Carlo approach would be easier. That actually sounds more like most people I know than most people I know. :-)
Seems like quite a difficult question to compute exactly.
I reworded the question to make it clearer and then it was able to simulate a bunch of scenarios as a monte carlo simulation. Was your hope to calculate it exactly with dynamic programming? GPT-4 was not able to do this, but I suspect neither could a lot of your interns.
These are very good questions that anyone with the ability to reason would ask if given this problem.
You're asking GPT to do maths in its head, the AI equivalent of a person standing in the middle of the room with no tools and getting grilled in a oral examination of their knowledge.
Instead, collaborate with it, while giving it the appropriate tools to help you.
I asked it to write a Monte Carlo simulation of the problem in Wolfram Mathematica script. It did this about 10-100x faster than I would have been able to. It made a few small mistakes with the final visualisation, but I managed to get it to output a volumetric plot showing the 3D scatter plot of the histogram of possible outcomes.
I even got it to save a video of the plot rotating: https://streamable.com/2aphbz
All told, I'd say it's a decent answer.
Edit: I took it to completion:https://chat.openai.com/c/6cdd92f1-487a-4e1c-ab94-f2bdbf282d...
These were the first responses each time, with no massaging/retires/leading answers. I will say it's not entirely there. I re-ran the initial question a few times afterwards and one was basically giberish.
e composition of what's left in the container.
There's a couple of scenarios, which depend on when you run out of blue balls:
1) If you draw all 50 blue balls first, for every blue ball you draw, you are adding 2 white balls. Hence, once you've drawn all the blue balls, you have replaced them with 100 white balls. This means you've drawn 50 balls total and there's now 100 white balls and 50 red balls (150 balls) in the container. Then, you continue and draw the next 50 balls. These could be any combination of white and red balls, we don't know which color is preferred when you continue drawing.
2a) If you draw all red balls in these next 50 draws, you would have 50 red balls out and 50 blue balls out. So, you would end with 100 white balls in the container.
2b) If you draw all white balls in these next 50 draws you would have 50 red balls, 50 blue balls, and 50 white balls out. So, you would end with 50 white balls and 50 red balls in the container.
2c) If you draw a mixture of red and white balls in these next 50 draws, the number leftover in the container would vary according to the specific combination drawn.
Remember, the order in which the balls are drawn matters to this question since we have a rule in place that changes the overall total (or population) of balls within the container. In each of these scenarios, you've drawn 50 times initially (all blue) and then 50 times again (in some unknown combination of red and white). You've drawn 100 times total and changed the number of white balls in the container from zero to an amount dependent on how many times you drew a white ball on your second round of 50 draws.
2b) If you draw all white balls in these next 50 draws you would have 50 red balls, 50 blue balls, and 50 white balls out. So, you would end with 50 white balls and 50 red balls in the container.
... so after removing 100 balls, I've removed 150 balls? And the 150 balls that I've removed are red, white and blue despite the fact that I removed 50 blue balls initially and then 50 white ones.
Seems like it reasons it's way to this answer at the end to me: Mind you, while averages are insightful, they don't capture the delightful unpredictability of each individual run. Would you like to explore this delightful chaos further, or shall we move on to other intellectual pursuits?
Took a bit of massaging and I enabled the Data Analysis plugin which lets it write python code and run it. It looks like the simulation code is correct though.
Uhm.
You see reasoning issues when you use more real world examples, rather than theoretical tests.
I had 4 failure states.
1) Summarization: It summarized 3 transcripts correctly, for the fourth it described the speaker as a successful VC. The speaker was a professor.
2) It was to act as a classifier, with a short list of labels. Depending on the length of text, the classifier would swap over to text gen. Other issues included novel labels, new variations of labels, and so on.
3) Agents - This died on the vine. Leave having to learn asynch, vector DBs or whatever. You can never trust the output of an LLM, so you can never chain agents.
4) I focused on using ChatGPT to complete a project. I hadnt touched HTML ever - the goal was to use ChatGPT to build the site. This would cover design, content, structure, development, hosting, and improvements.
I still have trauma. Wrong code, bad design, were base issues. If code was correct, it simply meant I had dug a deeper grave. I had anticipated 70% of the work being handled by ChatGPT, it ended up at 30% at the most.
ChatGPT is great IF you already are a subject expert - you can brush over the issues and move on.
"Hallucinations" is the little bit of string that you pull on, and the rest unravels. There are no hallucinations, only humans can hallucinate - because we have an actual ground truth to work with.
LLMs are only creating the next token. For them to reason, they must be holding structures and proxies in some data store, and actively altering it.
Its easier to see once you deal with hallucinations.
Example of basic problem: In a shop, there are 4 dolls of different heights P,Q,R and S. S is neither as tall as P nor as short as R. Q is shorter than S but taller than R. If Kittu wants to purchase the tallest doll, which one should she purchase? Think step by step.
You are really trying to make it not have reasoning for your own benefit
This whole thread really seems like it's the other way around. It's still very easy to make ChatGPT to spit out obviously wrong answers depending on the prompt. If it had actual ability to reason as opposed to just generating continuation to your prompt, the quality of the prompt wouldn't matter as much
If that's the case, then most humans alive would fail to meet this threshold. Finding a general solution to a specific problem, and identifying whether or not there exist a closed-form solution, and even knowing these terms, are skills you're taught in higher education, and even the people who went through it are prone to forget all this unless they're applying those skills regularly in their life, which is a function of specific occupations.
GPT 4 goes into detail about one example scenario, which most humans won't do, but it is technically correct answer as it said it depends on the order.
- *Ending Scenario:* - Red Balls (RB): 0 (all have been drawn) - Blue Balls (BB): 50 (none have been drawn) - White Balls (WB): 0 (since no blue balls were drawn, no white balls were added) - Total Balls: 50
It should give you pause that you had to pick not only the line by which to judge the answer but the part of the line. The sentence immediately before that is objectively wrong:
> This is one possible scenario.
It says the number of blue balls drawn is x and the number of red balls drawn is y, and then asserts x + y = 100, which is wrong.
Then it proceeds to "solve" an equation which reduces to x = x to conclude x = 0.
It then uses that to "prove" that y = 100, which is a problem as there are only 50 red balls in the container and nothing causes any more to be added.
It's like "mistakes bad students make in Algebra 1".
I've had to work with imperfect machines a lot in my recent past. Just because sometimes it breaks, doesn't mean it's useless. But you do have to keep your eyes on the ball!
I think that's the crux of the whole argument. It's an imperfect (but useful) tool, which sometimes produces answers that make it seem like it can reason, but it clearly can't reason on its own in any meaningful way
This problem doesn't result in a constant value, it results in a 3D probability distribution! Very, very few humans could work that out without tools. (I'm including pencil and paper in "tools" here.)
With only a tiny bit of coaxing, GPT 4 produced an animated video of the solution!
Try to guess what fraction of the general population could do that at all. Also try to estimate what fraction of general software developers could solve it in under an hour.
Gpt-4 with help from a competent human will of course do better than most humans, but that isn't what we are discussing.
I disagree. Don't assume "most humans" are anything like Silicon Valley startup developers. Most developers out there in the wild would definitely struggle to solve problems like this.
For example, a common criticism of AI-generated code is the risk of introducing vulnerabilities.
I just sat in a meeting for an hour, literally begging several developers to stop writing code vulnerable to SQL injection! They just couldn't understand what I was even talking about. They kept trying to use various ineffective hacky workarounds ("silver bullets") because they just didn't grok the the problem.
I've found GPT 4 outperforms median humans.
On what basis do you allege this? People say the most unhinged stuff here about AI, and it so often goes completely unchallenged. This is a huge assertion that you are making.
You’re asked a question and you just have to spit out the answer. No option to backtrack, experiment, or self correct.
“Translate this to Hebrew”.
“Is this a valid criticism of this passage from a Platonic perspective?”
“Explain counterfactual determinism in Quantum Mechanics.”
“What is the cube root of 74732?”
You would fail all of these. The AI gets 3 of 4 correct.
Tell me who’s smarter?
You because of your preconceptions, or because of real superiority?
Your model for human intelligence is probably more like this scene: https://youtu.be/KvMxLpce3Xw?si=Suy0Cj_pL0vru5Uj
The reality is the opposite. The AI could answer questions in this scenario but no nonfictional human could.
>“Is this a valid criticism of this passage from a Platonic perspective?”
I haven't seen AI answering questions like this correctly at all