“Don Knuth Plays with ChatGPT” but with ChatGPT-4

“Don Knuth Plays with ChatGPT” but with ChatGPT-4(gist.github.com)

223 points by LifeIsBio 3 years ago | 132 comments

LifeIsBio 3 years ago |

This is a reference to: https://news.ycombinator.com/item?id=36012360

blazespin 3 years ago |

The sequence of these two threads is just too perfect. Almost likely someone is trying to make a point.

jonas21 3 years ago | |

How so? Don Knuth wrote about his experience with ChatGPT. It was submitted to HN and made it to the front page. Someone saw this and decided to submit the same questions to GPT-4 and posted the results. This seems like a perfectly normal sequence of events.

dotancohen 3 years ago | | |

Knuth even mentioned GPT-4 and lamented not having access to it for the test.

LifeIsBio 3 years ago | | |

That’s exactly what happened. :)

rodoxcasta 3 years ago | |

> The sequence of these two threads is just too perfect. Almost likely someone is trying to make a point.

Exactly! Almost every weak point that Knuth commented is fixed in GPT4 answers.

Maybe OP feed Knuth's observations to the model?

If that ins't the case, I'm really impressed.

placesalt 3 years ago | |

@dang repetition

kibwen 3 years ago |

>> What is the most beautiful algorithm?

> Quicksort Algorithm

Definitive proof that AI must be stopped. Ranking quicksort as more elegant than heapsort?!

bee_rider 3 years ago | |

That is a weird way of spelling mergesort.

hannasm 3 years ago | | |

I believe radix sort belongs first in this list.

web3-is-a-scam 3 years ago | | |

That is a weird way of spelling Bogo Sort.

Rebelgecko 3 years ago | | |

Sleepsort is the most elegant & efficient sorting algorithm

boosteri 3 years ago | |

Beauty is in the eye of the beholder. I look no further than bubble sort -- it is simple enough I can recite it straight away should someone wake me up at modnight.

spiorf 3 years ago | | |

Bubblesort is the bestsort.

0xBA5ED 3 years ago | |

Well there is something rather satisfying about partitioning.

jameshart 3 years ago |

Worth noting also that, while asking Bing chat to "Tell me what Donald Knuth says to Stephen Wolfram about chatGPT" doesn't (yet) produce exactly the right result, it produced the following answer when asked what Donald Knuth says about chatGPT:

> Donald Knuth, a computer scientist and mathematician known for his contributions to the field of computer programming, particularly in the area of algorithms and data structures, has expressed some skepticism about the potential of artificial intelligence to achieve true human-level intelligence and creativity[1]. He once conducted an experiment with chatGPT where he posed 20 questions to it and analyzed its responses[1]. Is there anything specific you would like to know about his views on GPT?

With [1] being a citation link to https://cs.stanford.edu/~knuth/chatGPT20.txt

PebblesRox 3 years ago | |

I’d be curious to know if someone could get a more “valiant effort” version of those first two questions with some prompt engineering. E.g. if it was asked to roleplay a conversation with the proper disclaimers to override its objection to not knowing what they actually think.

jameshart 3 years ago | | |

Bard just dives right in and role-plays it. It honestly feels kind of barbaric compared to the more sophisticated GPT4 answers.

felixding 3 years ago | |

I find it's amusing that people follow Apple's naming conventions (ChatGPT -> chatGPT), even when products makers don't.

jameshart 3 years ago | | |

Apple? Nah. I'm just an unrecovered JavaScript developer.

   https://developer.mozilla.org/en-US/docs/Web/API/Element/innerHTML
   https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/encodeURI

ryanseys 3 years ago |

It now knows to communicate that the NASDAQ doesn't operate on Saturdays.

ResearchCode 3 years ago | |

Did it know that before the last LLM failure was posted on Twitter or Hackernews? Trawling tech media for LLM failures can be assumed to be part of the "human feedback".

Falcorian 3 years ago | | |

Yes, the models are not constantly learning. They only update their knowledge when they are retrained, which is pretty infrequently (I think the base GPT models have not been retrained, but the chat laters on top might).

astrange 3 years ago | | |

It doesn't continually learn anything. Though some models can do web browsing and be guided by the results of that.

erwincoumans 3 years ago |

It makes you wonder why Knuth bothered with an outdated ChatGPT version? He couldn't find someone with access to GPT-4?

camdv 3 years ago | |

It was his grad student's decision.

ilaksh 3 years ago | |

He wasn't that interested and probably didn't know there were two versions. Eventually someone did give him the GPT-4 version I think.

keithalewis 3 years ago | | |

Outdated? Two versions? We're talking on the order of months and dozens of versions.

Maybe he has seen similar claims before and is too old and dumb to not realize how world changing this is.

My take away is that he views this as another tool we are still figuring out how to use.

benatkin 3 years ago |

Reminds me of that time AlphaGo got its ass handed to it multiple times, and then a short while later...

hamilyon2 3 years ago | |

AlphaGo is when I lost hope for humans

ec109685 3 years ago |

Interesting both completely whiff on the number of chapters in the Haj.

fnordpiglet 3 years ago |

What I find amazing about the original exchange was the profound lack of curiosity Knuth demonstrated. Because the model wasn’t flawless in performance he pinned it as a curiosity that was good at grammar and vacuous otherwise and wasn’t interested to hear how it improves. This reminds me of an awful lot of the computing field in this drama as it plays out. People that literally know how implausible any of these feats have been using traditional approaches immediately discount the entire thing the moment it hallucinates - and it feels like the more deterministic the bent of the person the more absolutely dismissive they are of what’s transpiring in front of us.

These models are doing feats that are stupendous and impossible before their advent. Not just a little bit, but the capability differences are so vast that it’s perhaps not even recognizable by people as being as vast as it is. I am impressed that Wolfram seems to have immediately grasped its significance and is running with it.

The fact this gist demonstrates essentially every single flaw was addressed. But that Knuth apparently doesn’t know / care months after GPT4’s introduction is demonstrative of a different type of personality.

I know which I aspire to be.

SomewhatLikely 3 years ago |

Thank you for specifying ChatGPT-4. So many commenters on the web say they used GPT4 without specifying if they're using the ChatGPT version. ChatGPT-4 is specifically aligned for answering questions better than the base GPT4 model.

victoryhb 3 years ago | |

The official name for the model has always been GPT-4. OpenAI has not used the term ChatGPT-4.

cubefox 3 years ago | | |

It makes sense to call the foundation model GPT-4, like for the previous GPT versions. The fine-tunings are not where its core capabilities come from. Bing is also "a" GPT-4, just with different fine-tuning.

dotancohen 3 years ago |

I would not be surprised if these questions become some form of canonical test for future language models.

Obviously, being the work of Knuth, they are extraordinarily insightful in peeling back the first layer of the answer and providing insight to the underlying properties of both the model itself, and the dataset on which it was trained. It also tests the ability to compute (not recite) very specific facts (e.g. when the sun will be directly above Japan), so checks if subroutines and ephemerides specific to this type of data exist.

But beyond the obvious technical merit - there is an alluding property to base our tests on those whom we respect. I used a similar - but far less sophisticated - set of questions when first exploring ChatGPT. But nobody will be drawn to Dotan Cohen's language model benchmarks - rightfully so. The name Knuth has such reverence in the field that I forsee this test, and variations on it to prevent rigging, becoming a canonical test of language models.

billylo 3 years ago |

You made me curious about who Bard would respond to them. Here they are:

https://gist.github.com/billylo1/bb717512d2d5145ce7eec02d055...

Notable: Bard struggles in similar ways. It does mention NASDAQ close at 12,043.59 on Friday, May 20, 2023

underdeserver 3 years ago |

Interesting that it didn't get the 5-letter word sentence right.

HarHarVeryFunny 3 years ago | |

It's fed sub-word tokens not letters (even though it can split a word into letters), and apparently struggles with counting in general. No doubt some of the things it struggles with could be improved with targeted training, but others may require architectural changes.

Imagine yourself trying to use only 5 letter words if you can't see how many letters are actually in each word, and had to rely on a hodgepodge of other means to try to figure it out!

Sharlin 3 years ago | |

Based on my experiments it usually does get it right (18 correct answers out of 20 attempts), and the failures I got were similar to this one: a single six-letter word in an otherwise correct sentence.

eternalban 3 years ago | | |

Sam and friends must be giggling all the way to the bank: they have a service that 'probably' gives the correct result and paying customers are happy to retry until it gets it right.

nttl 3 years ago | |

ChatGPT: You didn't say 5-non-repeat-letters, human, jez

harshreality 3 years ago | | |

Both the first and last words have repeating letters, so they fail under that interpretation too. There would have to be a bizarre interpretation that consecutive-repeating letters are counted as one, but non-consecutive are counted separately, for its response to be considered correct.

An AI aware of how to optimally answer questions put to it would find the least objectionable interpretation when one is a subset of the other. It also failed by not constructing a simpler sentence, like subject-verb-object or subject-verb-adjective-object, since its limitations related to letters and tokens, and its failure to double check its answers before output, mean it can make errors. The more it writes, the more chance it has of making an error.

ftxbro 3 years ago | |

it's just like Gary Marcus said

bpicolo 3 years ago |

Most importantly, much better wonton recipe.

jiggawatts 3 years ago | |

Am I the only one thinking that that recipe actually sounds pretty delicious? Almost tempted to go try it…

jdougan 3 years ago | | |

Do it! And tell us how it went.

jliptzin 3 years ago | | |

Yea, it sounds good. I wonder if I’ll like it more than the DMV’s cheeseburger recipe.

8thcross 3 years ago |

thats a shitload of difference between its previous version!

cratermoon 3 years ago |

Literary Libations: https://cratermoon.substack.com/p/the-literary-libations

axpy906 3 years ago |

Nailed every one. Some by saying not possible to answer but still.

sebzim4500 3 years ago | |

Got the 'five character word' question wrong. Admittedly I also thought it was correct at first glance but then went back when someone called it out in another comment.

cubefox 3 years ago | | |

I tried it with Bing (precise/creative) and it got both attempts right.

"Their house never holds fewer books."

"Every night, stars shine above."

gfodor 3 years ago | | |

Language models struggle specifically with token games like this, since they can’t see them at that resolution or something.

mod50ack 3 years ago | |

Didn't nail the Rodgers and Hammerstein one; it still doesn't understand the reference to the ballet or that the "themes" in the question are musical.

bombcar 3 years ago | | |

I wouldn’t be surprised if half the Internet does not know that a ballet is part of a larger show.

usaar333 3 years ago | |

Japan one seems wrong or at least wrongly explained. Japan controls Okinotorishima which is at 20 degrees north.

But still impressive deductive reasoning.

cratermoon 3 years ago | | |

In case anyone wants to know what the southernmost part of Japan looks like: https://en.wikipedia.org/wiki/Okinotorishima#/media/File:Oki...

ironSkillet 3 years ago | |

I also counted 4 errors in the sentence, not 3. "no help" should be "any help". This might just be conventionally wrong, not technically wrong I suppose.

housecarpenter 3 years ago | |

The Haj answer is still wrong; it says it has 8 chapters, while according to Knuth it has 77 chapters.