Evaluating the Logical Reasoning Ability of ChatGPT and GPT-4

Evaluating the Logical Reasoning Ability of ChatGPT and GPT-4(arxiv.org)

44 points by saurabh20n 3 years ago | 60 comments

RayVR 3 years ago |

This aligns well with my personal experience using gpt-4.

The model provides surprisingly good responses on topics which I know are readily available online while being potentially troublesome to find the exact information I want. I have even found it useful when I know there is a tool for what I want but can’t recall the jargon used to find it via Google. Simply describing the rough idea is enough to get the model to spit out the jargon I need.

However, the moment I ask a real question that goes beyond summarizing something which is covered thousands of times online, I am immediately let down.

Is this just a result of the foundation of the model being the world best autocompletion engine? My assessment is “yes” and I don’t believe that any of the modifications coming, like plugins, will fundamentally change this.

raydiatian 3 years ago | |

I have been thinking for a few weeks now that we need another term for large language models trained on colossal datasets: AGK, artificially generally/globally knowledgeable. It can mimic a likeness of problem solving because the corpus it was trained on is full of problem/solution pairs in the abstract. But task it with any novel problem solving challenge outside of its training that is of sufficient complexity and it will balk, thereby precluding it from being AGI, because humans are by nature problem solvers.

Furthermore, I just don’t feel like the transformer architecture is suited for problem solving. Like I may just be a charlatan but self attention over the space of words does not seem like it’s going to be enough, and praying it falls out in emergent behavior if we can just add more parameters is… unscientific-ish? Now, if you could figure out a way to do self-attention over the space of concepts? Maybe you’ve got something.

I feel like AlphaGo ideas and some variation on MCTS is more likely to produce a solid problem solving architecture.

famouswaffles 3 years ago | | |

Reading the paper it seems they are problems a lot of people would fail at it too, at least some of the time. LLMs are not superhuman in logical reasoning seems to be the conclusion more than anything.

outofpaper 3 years ago | | |

Often it can actually solve more complex problems but needs to have its "hand held". Essentially the model needs to be guided to/through problem solving techniques. We have to remember that LLM are literally inference engines. They default to providing us with probable results, probable responses. They can pe pulled away from these "knee jerk" responses.

rvz 3 years ago | |

> However, the moment I ask a real question that goes beyond summarizing something which is covered thousands of times online, I am immediately let down.

I'm very sure I said this from the start, against the ridiculous hype. Summarization of existing text is the *only* safe use case for LLMs. Anything else is asking for disappointment.

We have already seen it used as a search engine and it confidently hallucinates incorrect information. We have seen it pretend to be a medical professional or a replacement attorney or lawyer and it has outright regurgitated nonsensical and dangerous advice - making itself completely unreliable for that use-case especially since (deep) neural networks in general are still the same black-boxes, unable to explain and reason about their own decisions; making them unsuitable for high risk applications and use-cases.

As for writing code, despite what the hype-squad tells you both GPT-4 and ChatGPT the ground reality is that it generates broken code from the start and cannot reason why it did that in the first place. Non-programmers wouldn't question its output where as an experienced professional would catch its errors immediately.

Due to its untrustworthiness, it means than now programmers have to check and review the output that has been generated by GPT-4 and ChatGPT every-time in their projects than before.

The AI LLM hype has only further exposed its limitations.

pottspotts 3 years ago | |

For a significant number of software developers, GPT and Github's Copilot have replaced StackOverflow, and even Googling more generally. It is more than an autocomplete, it is the best resource for software development by far, IMO. It's a tutor that's an expert in virtually every topic.

mftb 3 years ago | | |

Yea it's not. Sorry to contradict, but it's not like that. In any kind of tutoring arrangement you're time with them is limited, and if they're any good, they don't just regurgitate limitless example code. Two of the most important decisions that an instructor has to make are, how much access to give you, and how much example material to give you, because the actual learning begins when you have to think for yourself, and you are forced to confront a black screen with a flashing cursor, and fill it with your own ideas. So interacting with ChatGPT may be a great experience, but it's not that. Maybe someday it will be.

inopinatus 3 years ago | | |

It really isn't. GPT-4 is certainly an improvement over previous language models, but when I vaingloriously gave it the questions from favourite self-answers on StackOverflow, only one completion was immediately correct. The remainder were variously suboptimal, poorly crafted, overdesigned, incomplete, or downright wrong, requiring multiple re-prompts to coax into usable condition. The they were all syntactically valid but tended to misconstrue the semantics and underestimate the capabilities of the programming environments concerned. Try it with your own, but to me it's more like coaching a bright but inexperienced junior developer with the "confidently incorrect" trait.

skepticATX 3 years ago | | |

I have to completely disagree with this.

Where GPT-4 shines for me is when I have a project swimming around in my head that I want to work on for fun. It can get you off of the ground quickly, and for side projects the quality and correctness of the output isn't that important.

For professional software development, GPT-4 is still wrong way too often for me to feel comfortable using it. And it's not all that much faster than going straight to the source anyways.

arghnoname 3 years ago | | |

When people just ask chatGPT for solutions and there's no community, a la stack overflow, where will it get the answers to future problems?

If chatGPT is too successful and people stop producing content because chatGPT is too successful, it might end up in a local optima that isn't so optimal.

mattdeboard 3 years ago | | |

No, even expert tutors know how to say “I don’t know” in the face of uncertainty, instead of remorselessly spitting up nonsense as language models do.

boringuser2 3 years ago | | |

I don't agree.

I still use stack overflow regularly as an engineer.

Sometimes GPT-4 will have a quicker tailor-fit answer, but sometimes it will flounder as well.

anothernewdude 3 years ago | | |

Expert as of 2021, which is obsolete for many software dev purposes, not that SO is much better.

dbrueck 3 years ago | |

Similarly, when I think of ChatGPT as a really cool and advanced search engine frontend, its behavior - including its limitations and its failures - make the most sense to me.

seba_dos1 3 years ago | | |

It's a language model, not a search engine. It doesn't work well as one unless integrated into an actual search engine, like Bing does. Without such integration, it's much closer to human memory than search engine - it will recall stuff it has seen many times pretty well and completely fail at stuff it just glanced over once, filling any gaps with made up stuff like a kid on an exam hoping to get at least a few points with their wild guesses.

HWR_14 3 years ago | | |

> a really cool and advanced search engine frontend

This is the saddest version of ChatGPT I can imagine. I found that as search engines emulated natural language, their results got steadily worse.

I just want the Google results and interface from a long time ago.

seba_dos1 3 years ago | |

> I am immediately let down

Why? I'm not sure how could you expect anything else in the first place.

Closi 3 years ago |

I think one main failure in the framing of these papers (and discussion of LLMs more broadly) is that the abstract says that GPT4 ‘struggles’ with logical reasoning:

> ChatGPT and GPT-4 do relatively well on well-known datasets […] however, the performance drops significantly when handling newly released and out-of-distribution [where] Logical reasoning remains challenging for ChatGPT and GPT-4

But reading the paper the challenges it is failing on are ones that I wager the average human would fail on too (at least a good portion of the time).

The paper might strictly be accurate, but I think we should try and bring these papers back to a real-world context - which is that it’s probably operating above your average human at these tasks.

Is superhuman/genius-level capability really required before we say the LLMs are any good?

(I see this view on HN too - statements like ‘LLMs can’t create novel maths theorems!’ as an argument that LLMs aren’t good at reasoning, disregarding that most humans today can’t find novel/undiscovered maths theorems)

jillesvangurp 3 years ago |

People aren't that good at logic either. So, gpt-4 not being great at this is maybe not that surprising.

Probably the best feature of gpt-4 is the ability to use tools. For example, it may not be that good at calculating things. But it can use a calculator. And if you think about it, a lot of people (including mathematicians) aren't actually that good at calculating either. We all learn it in school and then we forget much of it. That's why we have calculators. It's not a big deal.

Gpt-4 is more than capable of knowing the best tool for the job. Figuring out how to use it isn't that hard. You can actually ask it "what's the best tool for X", get a usable response, and then ask a follow up question to produce a script in the language of your choosing that demonstrates how to use it, complete with unit tests. Not a hypothetical, I've been doing that in the past few weeks and I've been getting some usable results.

And that's put me in a mind of wondering what will happen once we start networking all these specialized AIs and tools together. It might not be able to do everything by itself but it can get quite far figuring out requirements and turning those into running code. It's not that big of a leap from answering questions about how to do things to actually building programs that do things.

causality0 3 years ago |

They're good in "memory" reasoning but terrible in deductive reasoning. Like if you say there's a sign in front of a door saying "push" it will tell you you need to push the door, but if you say there was a powerful wind and you see a sign saying "pull" laying on the ground on the other side of a glass door it has no idea if you should push or pull.

cjbprime 3 years ago | |

I guess I'm with the LLM on this one, since I can't follow your example. Did the sign flip over while it was falling? Did the sign fall towards or away from the glass door that I am on the other side of? Where are the doorhandles?

Can you write this example in a way that's more comprehensible to humans, and then we can ask GPT-4 about it?

causality0 3 years ago | | |

The sign only has one side. It's a sign saying "pull" that was knocked off the other side of a glass door by the wind.

progrus 3 years ago |

Will we ever get apologies from the AI-Foom crew for losing their marbles and riling people up about the word calculator?

micromacrofoot 3 years ago | |

word calculator is a more impressive title than I’d grant some people

dmz73 3 years ago |

LLMs are just programs that can produce human-like language output based on human-like language input and that calling them AI of any kind is greatly overstating their capabilities. There is no "reasoning" or "understanding" here, there is just a giant ball of mud full of auto-generated if-then-else like code with calls to random number function peppered around.

The two main problems I see with attributing AI to these programs are: 1. People will assume they are receiving intelligent response they can rely on without sanity checking. This is different than receiving the same response from other people because one learns to know who to trust and when. You can never trust these programs. 2. If/when real AI emerges it will be treated poorly because most people will assume it is the same "brainless" AI they were sold so many times before. In that respect the treatment of real AI will be equivalent to child abuse or slavery and will result in another giant black mark in human history.