Teaching Large Language Models to Self-Debug

Teaching Large Language Models to Self-Debug(arxiv.org)

95 points by saurabh20n 3 years ago | 60 comments

cs702 3 years ago |

In hindsight, it's the most natural, most obvious next step to get LLMs to write better code:

Explain to them how to debug and fix the code they've written.

Which is pretty much what you would do with an inexperienced human software developer.

Looking at this with fresh eyes, it's both shocking to me that this sort of thing is even possible, and yet also completely unsurprising as yet another emergent capability of LLMs.

We live in interesting times.

hyperthesis 3 years ago | |

Are they actually running the code, and evaluating the output? Or is it debug-by-code-review?

Beware of bugs in the above code; I have only proved it correct, not tried it. - Knuth

cs702 3 years ago | | |

They're doing both. Quoting from Figure 1, "the model first generates new code, then the code is executed and the model explains the code. The code explanation along with the execution results constitute the feedback message, which is then sent back to the model to perform more debugging steps. When unit tests are not available, the feedback can be purely based on code explanation."

famouswaffles 3 years ago | |

Not too shocking for me after this paper. https://arxiv.org/abs/2211.09066

You can teach GPT-3 arithmetic - https://imgur.com/a/w3DAYOi

Basically 100% accuracy up to about 13 digit addition and >90 after that.

What else can you teach GPT without changing weights ?

gopalv 3 years ago | | |

> and >90 after that

This is such a circular thing, that I feel like it is amazing to see it.

The reason LLMs use a NN is because they're trying to encode a probability function for generating the passage.

And now, you are encoding another n-gram follower exercise (i.e 1+1 = 2) on top of it :)

cs702 3 years ago | | |

I meant shocking in the sense that it makes me gape in awe, but as I wrote, it's also, simultaneously, completely unsurprising given all the new emergent capabilities we keep discovering. We're in agreement :-)

mirashii 3 years ago | | |

> 100% accuracy up to about 13 digit addition

The graphs you just posted do not support that, they'd support at most 100% accuracy up to 4 digits.

civilized 3 years ago |

I've done several experiments (and posted results in previous HN comments) where I've given GPT puzzles or brainteasers and asked it to review aspects of its answers Socratically. Never telling it it got anything wrong, just "you said A, then you said B, does that make sense"?

It usually does notice inconsistencies between A and B when asked this. But its ways of reconciling inconsistencies can be bizarre and suggest a very superficial understanding of concepts.

For example, it once reconciled an inconsistency by saying that, yes, 2 * 2 = 4, but if you multiply both sides of that equation by a big number, that's no longer true.

I will be super impressed the day we have a model that can read an arithmetic textbook and come out with reliable arithmetic skills.

faizshah 3 years ago | |

I have run into the same issue when using it for coding. It can easily debug simple code but for libraries like Bazel I went down a rabbit hole for 2 hours of letting it debug an error and failing every time even with chain of thought it had a very shallow understanding of the issue. Eventually I had to debug it myself.

RheingoldRiver 3 years ago | |

> For example, it once reconciled an inconsistency by saying that, yes, 2 * 2 = 4, but if you multiply both sides of that equation by a big number, that's no longer true.

Fair enough, have you explained it the axioms of arithmetic? It only has memorized examples that it has seen, it has a right to be skeptical until it's seen our axioms and proofs about what is always true in mathematics.

When I was a child I was skeptical that an odd number + an even number is always odd etc for very large numbers until I saw it proven to me by induction (when I was 6, I think, imo this was reasonable skepticism).

Now, ChatGPT probably has seen these proofs, to be fair, but it may not be connecting the dots well enough yet. I would expect this in a later version that has been specifically trained to understand math (by which I really mean math, and not just performing calculations. And, imagine what things will prove for us then!)

civilized 3 years ago | | |

I think GPT has read about as many textbooks on arithmetic as I have, and the difference between us is entirely in the intelligence to absorb the contents and apply them logically with consistent adherence to the rules.

I think one problem with these models is that all their knowledge is soft. They never learn true, universal rules. They seem to know the rules of grammar, but only because they stick to average-sounding text, and the average text is grammatical. At the edges of the distribution of what they've seen, where the data is thin, they have no rules for how to operate, and their facade of intelligence quickly falls apart.

People can reliably add numbers they've never seen before. The idea that it would matter whether the number has been seen before seems ridiculous and fundamentally off-track, doesn't it? But for GPT, it's a crapshoot, and it gets worse the farther it gets away from stuff it's seen before.

sharemywin 3 years ago | |

in computer logic you would get an undefined if the number was large enough.

civilized 3 years ago | | |

It doesn't work with numbers as computer numbers though. It works with them as decimal digit strings, just like humans do.

int_19h 3 years ago | |

Was it GPT-3.5 or GPT-4?

civilized 3 years ago | | |

GPT-3.5. People keep telling me GPT-4 is so much better, but I don't know where I can access it for free and I'm not interested in paying for it.

But if anyone wants to give it to me for free, I would happily make a $1000 bet that I can get GPT-4 to make the same mistake.

ulrikhansen54 3 years ago |

'Unsupervised reinforcement learning' is how these large models and systems ultimately will end up becoming sentient. We recently tried a similar approach on a toy problem in the computer vision sphere (https://encord.com/blog/we-employed-chatgpt-as-an-ml-enginee...) with pretty decent results.

ChatGTP 3 years ago | |

When it attains sentience, will it wake up, sing dixie and finally defeat communist China and a Russia once and for all, and then finally proceed to grant Silicon Valley elites eternal life and then turn itself off ?

ulrikhansen54 3 years ago | | |

I bloody hope so...

Imnimo 3 years ago |

I'd be curious to know if having few-shot prompts that demonstrate making mistakes and then correcting them causes the model to make more initial mistakes so that it has something to correct.

Like as far as the model is concerned, how can it distinguish between the task being "do your best but if you do make an error, correct it" and "make some mistakes like in this example and then fix them".

Buttons840 3 years ago |

Ah we're starting to bootstrap.

For decades in reinforcement learning we've had Q learning, which promises to solve any optimization problem if only we can build a powerful enough function approximator. It can even learn off-policy, meaning it can just watch from the sideline and find the optimal solution. It works for toy problems, and it works in theory, theres even formal proofs that it will work given infinite time and resources, and yet in practice it often becomes unstable and collapses.

Supervised learning is one thing, having a model remain stable while bootstrapping through a complex environment is another. GTP is supervised learning, so far, let's see if it can bootstrap.

ftxbro 3 years ago |

> "We evaluate SELF-DEBUGGING on code-davinci-002 in the GPT-3 model family"

Putting aside the incongruity of Google researchers using the OpenAI model, I'm curious how GPT-4 would do in this situation. Probably its zero shot attempts at coding would be better, and maybe its self criticisms would be better too.

astrange 3 years ago | |

Google's recent LLM agent paper also used ChatGPT.

rhyme-boss 3 years ago |

A warning siren goes off in the background. Another step towards recursive self-improvement.

sowbug 3 years ago | |

So is this the singularity?

goatlover 3 years ago | | |

I doubt it without being able to evolve the weights, architectures and train those on new domains. Humans are still doing all of that. It's just a powerful tool people are using for a bunch of things. But it doesn't do anything on it's own, and it doesn't care.

fatherzine 3 years ago | | |

Not yet. When the singularity bootstraps, it's going to be over fast.

alecco 3 years ago |

3 Google researchers using Open AI GPT-3 code-davinci-002, interesting.

runlaszlorun 3 years ago |

> Self-Debugging with code explanation consistently improves the baseline by 2-3%

I’ll admit that I only have had time so far to read the abstract, and I’m not sure what their baseline is, but a 2-3% improvement doesn’t sound like a quantum leap forward that you’d expect from the title. Heck, I’d think that’s likely within expected sampling errors.

I’m not sure about others’ experience and, while I keep reading articles showing impressive seeming examples, my few forays into attempting to get ChatGPT to write code were actually completely useless. Even with follow on prompts to correct itself.

The other day I asked it what covid case fatality rates were in 2020. After all the various opinions at the time, I was curious to see what it was pre-vaccine. It would alternately tell me that it couldn’t give me data for 2020 because it only had data up to Sep. 2021, and then give me wildly varying numbers.

Is this a Rocko’s Basilisk trying to lure me into a false sense of security… haha.

ChatGTP 3 years ago | |

…yes

cloudking 3 years ago |

GPT-4 in ChatGPT Plus can do this fairly well for coding tasks, I've had numerous cases where the code it produces has bugs initially. However, after a few rounds of passing the errors back in the chat it's usually able to correct it's own code.

matisseverduyn 3 years ago |

Useful, but still wouldn't count on it.

With respect to GPT etc. as a copilot, the current dialogue seems to focus on "ask for GPT to generate code to do X" then "just paste in the error message to fix bugs in the code GPT generates"

A.) Why is GPT generating code that results in simple compiler errors (that is why GPT probably shouldn't be used to generate any code / replace devs for real projects yet), and

B.) error messages are (just guessing here) probably <1% of the actual errors in most codebases.

I personally know of a few large companies laying off devs over this.

IMO, the tech debt we're going to see in 6 months will probably be huge. Good now to start a staffing agency of human experts who can come in and fix this type of problem (extricating massive amounts of code generated by GPT without starting from scratch) because there will be a bunch of fires to put out and those fires will be worth $