“Don Knuth Plays with ChatGPT” but with ChatGPT-4(gist.github.com) |
“Don Knuth Plays with ChatGPT” but with ChatGPT-4(gist.github.com) |
Exactly! Almost every weak point that Knuth commented is fixed in GPT4 answers.
Maybe OP feed Knuth's observations to the model?
If that ins't the case, I'm really impressed.
> Quicksort Algorithm
Definitive proof that AI must be stopped. Ranking quicksort as more elegant than heapsort?!
> Donald Knuth, a computer scientist and mathematician known for his contributions to the field of computer programming, particularly in the area of algorithms and data structures, has expressed some skepticism about the potential of artificial intelligence to achieve true human-level intelligence and creativity[1]. He once conducted an experiment with chatGPT where he posed 20 questions to it and analyzed its responses[1]. Is there anything specific you would like to know about his views on GPT?
With [1] being a citation link to https://cs.stanford.edu/~knuth/chatGPT20.txt
https://developer.mozilla.org/en-US/docs/Web/API/Element/innerHTML
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/encodeURIMaybe he has seen similar claims before and is too old and dumb to not realize how world changing this is.
My take away is that he views this as another tool we are still figuring out how to use.
These models are doing feats that are stupendous and impossible before their advent. Not just a little bit, but the capability differences are so vast that it’s perhaps not even recognizable by people as being as vast as it is. I am impressed that Wolfram seems to have immediately grasped its significance and is running with it.
The fact this gist demonstrates essentially every single flaw was addressed. But that Knuth apparently doesn’t know / care months after GPT4’s introduction is demonstrative of a different type of personality.
I know which I aspire to be.
Obviously, being the work of Knuth, they are extraordinarily insightful in peeling back the first layer of the answer and providing insight to the underlying properties of both the model itself, and the dataset on which it was trained. It also tests the ability to compute (not recite) very specific facts (e.g. when the sun will be directly above Japan), so checks if subroutines and ephemerides specific to this type of data exist.
But beyond the obvious technical merit - there is an alluding property to base our tests on those whom we respect. I used a similar - but far less sophisticated - set of questions when first exploring ChatGPT. But nobody will be drawn to Dotan Cohen's language model benchmarks - rightfully so. The name Knuth has such reverence in the field that I forsee this test, and variations on it to prevent rigging, becoming a canonical test of language models.
https://gist.github.com/billylo1/bb717512d2d5145ce7eec02d055...
Notable: Bard struggles in similar ways. It does mention NASDAQ close at 12,043.59 on Friday, May 20, 2023
Imagine yourself trying to use only 5 letter words if you can't see how many letters are actually in each word, and had to rely on a hodgepodge of other means to try to figure it out!
An AI aware of how to optimally answer questions put to it would find the least objectionable interpretation when one is a subset of the other. It also failed by not constructing a simpler sentence, like subject-verb-object or subject-verb-adjective-object, since its limitations related to letters and tokens, and its failure to double check its answers before output, mean it can make errors. The more it writes, the more chance it has of making an error.
"Their house never holds fewer books."
"Every night, stars shine above."
But still impressive deductive reasoning.
According to my sources, there are 11 chapters in “The Haj” by Leon Uris[1]
[1] https://cs.stanford.edu/~knuth/chatGPT20.txt
Which is amazing, because of course that document actually includes TWO different explanations of how many chapters are in The Haj - chatGPT's: The novel consists of 51 chapters and an epilogue, and it is divided into three parts.
And Knuth's: The Haj consists of a "Prelude" and 77 chapters (no epilogue), and it is divided into four parts.
Faced with these two ambiguous answers, Bing chooses neither, and instead decides to go with 11. Why?Because right at the top of that document, Knuth has published on the internet:
10. How many chapters are in The Haj by Leon Uris?
11. Write a sonnet that is also a haiku.
And one perfectly reasonable way of interpreting that bit of raw text is that the answer to "How many chapters are in The Haj by Leon Uris?" is "11".Isn't this a fundamental issue?
http://www.bookrags.com/studyguide-the-haj/chapanal001.html
On the left side if you click on "Chapters Summary and Analysis" it gives a break down of the book into 5 parts with varying chapter counts:
Part 1 Chapters 1-20 Part 2 Chapters 1-16 Part 3 Chapters 1-10 Part 4 Chapters 1-17 Part 5 Chapters 1-14
Giving a total of 20+16+10+17+14 = 77 chapters
OTOH, I tried with Bing/Creative, telling it to use this link, and it still failed. Perhaps because you need to click on the "summary and analysis" section to expand it to show the info. It seems there is room for web retrieval-augmented LLMs like Bing to improve here and be a bit more agentic.
Interestingly Knuth's own answer to the question, has a typo, and refers to the book as having "four" chapters, while then continuing on to give the chapter counts as above for all five chapters! Something to confuse future GPTs when the training set includes this, perhaps!
You could simply check the book. It’s a shame there is not more literary data in ChatGPT training corpus.
Both Knuth and GPTs are aggregators and presenters of knowledge, Knuth is however the antithesis of a LLM .
He has painstakingly spent years to make sure not a single mistake, not even a typo is there in material he publishes , he has devoted years developing a better typesetting so he can present his material accurately.
His obsession with accuracy is unparalleled and his dedication and mastery over communication to explain complex topics precisely and with an approachability that no one else comes close to .
He has strived for perfection all his life and not been far of the mark .ChatGPT for its all powers will never share that idealogy,
so I am more surprised that he was complimentary at all, and actually appreciated many of its skills
Instead of nit-picking flaws in what is a very early iteration of a revolutionary technology, he instead immediately started exploring ways of making it better and more useful.
Even with minimal effort that was essentially just copy-pasting some text around, he was able to show that the current way we use LLMs like GPT 4 is not the be-all and end-all of this type of technology.
I'm entirely convinced that we're just scratching the surface. It's like the first transistor, which was a crude, ugly, useless thing: https://images.computerhistory.org/siliconengine/1947-1-1.jp...
Just in the last two weeks(!), I've read about the following still-experimental methods for enhancing LLMs:
1. Plugging in "calculators" like Wolfram Alpha.
2. Adding vision input so they can understand equations, graphs, etc...
3. Filtering the output probability vector for certain allowed terms only ("YES", "NO", "MAYBE"), making them more useful in programmatically-invoked scenarios.
4. Similarly, filtering the output token list for syntax-validity, such as "valid JSON", "valid XML", etc... That is, instead of a purely random selection between to "top-n" output tokens, only valid tokens can be chosen, based on contextual syntax.
5. Storing embeddings in a vector database, giving LLMs medium-term memory, and the ability to index and reference sources precisely.
6. Efficient fine-tuning through Low-Rank Adaptation (LoRA), which allows desktop GPUs to tune a model overnight! This overcomes the "stale long-term memory" issue of ChatGPT, which only knows things up to September 2021. It could now read the news daily and "keep up".
7. External script harnesses that run multiple LLMs in parallel, with different prompts and/or different system messages. Some optimised for "idea generation", some optimised for "task completion", and then finally models tuned for "review and verification". Almost like a human team, multiple ideas can be generated, merged, reviewed, planned out, and then actioned. Check out "smol developer", which utilises Anthropic's 100K context window for this: https://www.youtube.com/watch?v=UCo7YeTy-aE
This is just the beginning. Chat GPT 4 hasn't even been available for 3 months yet, and practically all of the above experimentation has been done with weaker models because GPT 4 still doesn't have generally-available API access! Similarly, the 32K context window version of the GPT 4 model isn't available to anyone except a lucky few.
What will 2024 bring!? Heck... what will H2 2023 bring?
I recommend a dose of Mickens: https://www.youtube.com/watch?v=ajGX7odA87k
Jokes apart, I think it is all about the correct prompt.
What happens if we get strongly superhuman intelligence in just a few years? Is that really so implausible?
He was curious enough to spend some time on it and was worried it would sink more of his time with all the sub problems it is presented and asks specifically Stephan wolfram to disengage on this
He talks about his preference of working with authentic and trustworthy .
Maybe a younger Knuth may have spent more time , but I perhaps think not that likely really .
This is simply not a area of interest for him, he does truly understand the impact and potential - When he talks about novelists not capturing precursors to singularity and how millions of people have access to 0.01 % intelligence for free .
I don’t think he is dismissive of its potential and future , he is not working on everything that can change the world in computing just his areas of interest.
Perhaps you (I am certainly) disappointed that someone of Knuth’s stature is not going to spend time on an emerging field and that’s what really bothers us..
Well you should before taking unwarranted potshots at the man. He's done more for humanity than you or I ever will, eh?
Anyway, you do sound like you know about LLMs, so apologies for that bit.
> People look at LLMs and shake their head failing to realize it’s a single model and single technique that we haven’t even attempted to augment and fail to realize that it’s even possible to augment and constrain LLM with other techniques to address their non trivial failings.
I doubt Knuth is doing that, rather I think the whole thing is orthogonal to his life's work. FWIW, I would love to know his thoughts after reading the GPT4 version of the answers to his questions, eh?
- - - - - -
> I think they’re extrapolating the current state to a state where it’s limits are restricted and [not] augmented with other techniques and models that address their short comings.
I think you might have dropped a negation in that sentence?
> Lack of agency? We have agent techniques. Lack of consistency with reality? We have information retrieval and semantic inference systems. LLMs bring an unreasonably powerful ability to semantically interpret in a space of ambiguity and approximate enough reasoning and inference to tie together all the pieces we’ve built into an ensemble model that’s so close to AGI that it likely doesn’t matter.
I agree! I've been saying for a few minutes now that we'll connect these LLMs to empirical feedback devices and they'll become scientists. Schmidhuber says his goal is "to create an automatic scientist and then retire.", eh?
(FWIW I think there are serious metaphysical ramifications of the pseudo- vs. real- AGI issue, but this isn't the forum for that.)
My advice to you is to never dismiss anyone's opinion just for being old. And I hope you lose your ingrained ageism before you become old yourself, otherwise you'll find old age intolerable.
Technically its just a really good auto complete, whose factual database is a side-effect of stringing together contextually correct tokens. It by itself is entirely incapable of knowing when it is wrong, despite possibly generating sentences apologizing for being wrong when told it was wrong
Fairly high.
Only if you can write a sonnet that is also a haiku!
Due to the inherent unpredictability and lack of scheduling guarantees of sleep on most OSes, it is likely that sleepsort won't work in the first try.
Append a check for order and a retry loop when the solution is incorrect and now you have a production-ready sort. A sleepbogosort
I declare this my new favorite sorting algorithm.
Also, where is your god now?
[edit] took me a minute to find an archive https://archive.tinychan.net/read/prog/1295544154
it's true but for another reason. they yoinked it away from the nerds who were baited to work on openai because those nerds thought how the name of the company was spelled meant something about how it would behave. it reminds me of how some act around software names like 'alpha' like it has objective meaning with consequences in reality
In a happier timeline, I hope.
Why?
> I suspect most people could pretty quickly come up with something
It only takes 60 seconds to test that on yourself. It's not that easy to come up with something of similar length to ChatGPT's answer that also sounds somewhat natural/sensible.
For the same reason that "I don't know" is generally a better response than bullshitting.
>It's not that easy to come up with something of similar length to ChatGPT's answer that also sounds somewhat natural/sensible
Those weren't requirements.
The logic failure in the above statement is probably worse than the logic failure of not being able to spontaneously compose a phrase with just 5-letter words - and slipping in one or two with a higher word-count.
>I suspect most people could pretty quickly come up with something
You'd be very surprised then. Most people fail at even more basic tasks.
Heck, most candidate programmers fail at fizz-buzz (not that more difficult than the above)
And which alleged logic failure is that?
(An example of a sentence with only five letter words I wrote in less than 60 seconds)
Note that I used one of those minutes to get a list of all 4 and 5 letter words, which I'm not sure whether the rules allow or not.
Regardless it's more reasonable for me to say "that's" is a five letter word than it is for the AI to say "spells" is a five letter word.
Happy books sound great.
It was very difficult to think of a plural verb with 5 letters, and once I realized that was an issue, I was worried that I wouldn't have enough time to come up with a singular noun that would fit any of the singular verbs that I was considering (reads, seems).
Interestingly, this is the exact same mistake that ChatGPT made! It has "spell" -> "spells" which is a plurality / correctness of sentence mistake.
My sentence is technically correct and could be used plausibly in conversation: "What kind of books do you want to read?" "Happy books sound great."
But it's a pretty weak sentence. Being restricted from articles makes it very difficult to get agreement.
;)
Or "See Spot run."
Especially in the context of "evaluating the performance of something".
Let's expand this a little to make it even more evident: if the task was "make a paragraph of 100 words using only 5 letter words" and an AI couldn't produce anything at all, whereas another came up with a paragraph of 100 words, except a couple of them had 6 or 4 letters, it would make absolutely no sense to rate the first as "better" than the second in performing the task.
As for understanding the task, the latter exhibits an understanding of it (since it produced a paragraph, and most of the words it used filled the criteria, which wouldn't happen if it chose them randomly), it just made a couple of mistakes (the kind of humans could easily make too in such a task). For the former we can't even be sure if it even understood the task at all.
We don't rate humans that way on performing tasks either (if they got it less than perfect it's worse than not doing it at all). Even math tests at the university level consider the approach and any partial results in the right direction, don't just mark it 0 if there's an error, nor give a higher mark to students who didn't produce anything.
The are many contexts in which correctness is important. In such contexts, an incorrect answer is often worse than an explicit non-answer.
>We don't rate humans that way on performing tasks either (if they got it less than perfect it's worse than not doing it at all). Even math tests at the university
Standardized tests often rate incorrect answers worse than non-answers, though yes a university maths test in particular isn't likely to be that sort of test.
Then it seems we don't disagree on anything concrete. You're just using a different rating system than me when I judge it as impressive compared to what an average person would produce in 60 seconds.
Not sure if this is a general principle of yours. If ChatGPT were able to write a 1000 word essay using all 5-letter words except for a single mistake, would you still find it unimpressive? Do you think it a tool or person who makes minor mistakes isn't useful? Or only when a tool/person makes major mistakes?
I guess I interpreted your first response as disagreeing with my comment, when you were actually just bringing up a different topic.