ChatGPT as a Calculator for Words(simonwillison.net) |
ChatGPT as a Calculator for Words(simonwillison.net) |
I find that GPT's answers are for the most part more reliable the searches, specifically today's searches. In the last 12 months, search results have become so spammy with AI generated pages (oh the irony), that it's hard to find reliable answers.
So like search, I look at GPT's answers with a grain of salt and validate them, but these days I use GPT all day every day and search rarely. To be fair, I use it a lot because I have a GPT CLI that works just the way I want it to, since I wrote it :-). https://github.com/drorm/gish
It seems like you've been using similar workflows to what I've been trying for coding with gpt?
https://github.com/paul-gauthier/easy-chat#created-by-chatgp...
Also, I wonder how they decide what code is worth training on. Because a lot of code is written in poor style/has technical debt, it might be the case that these LLMs in the long run lead to an increase in the technical debt in our society. Plus, eventually, and this might already be happening, the LLM are going to end up training on their own outputs, so that could lead to self immolation by the model. I am not certain RLHF completely resolves this issue.
Rebecca Jarvis interviews Sam Altman for ABC News Rebecca Jarvis, https://www.youtube.com/watch?v=540vzMlf-54
(I don't think this contradicts what you said.)
Quoting what he says [0][1]:
> You know, a funny thing about the way we're training these models is I suspect too much of the like processing power for lack of a better word is going into using the models as a database instead of using the model as a reasoning engine. The thing that's really amazing about the system is that it, for some definition of reasoning, and we could of course quibble about it and there's plenty for which definitions this wouldn't be accurate. But for some definition it can do some kind of reasoning. And, you know, maybe like the scholars and the experts and like the armchair quarterbacks on Twitter would say, no, it can't. You're misusing the word, you know, whatever, whatever. But I think most people who have used the system would say, okay, it's doing something in this direction. And I think that's remarkable. And the thing that's most exciting and somehow out of ingesting human knowledge, it's coming up with this reasoning capability. However, we're gonna talk about that. Now, in some senses, I think that will be additive to human wisdom.
[0] https://steno.ai/lex-fridman-podcast-10/367-sam-altman-opena...
Google, in comparison, returned absolutely irrelevant SEO spam.
Sometimes search means “I can sort of describe what I’m looking for, can you tell me what it’s called?”. LLMs excel here. I told GPT4 I’m doing computer animation and want to do smooth blending, it told me that’s called “interpolation”, I asked for some common terms in the literature about this to help me look and it told me about LERP, SLERP, quaternions, splines, Beziers, keyframes, inverse kinematics, and motion capture. All useful jumping-off points. (A subset of this type of search is “I know what this is called, can you tell me more about it?”. This is probably the place where LLMs sell snake oil the most; they always provide a convincing explanation of the thing, but there’s no guarantee on veracity.)
Other times, search means “I have a specific phrase and I want to find occurrences of it”. LLMs aren’t just bad at this, they are constitutionally incapable of it. The way you build an LLM necessarily involves taking all specific phrases and occurrences thereof, and blending them up into a word slurry that is then condensed and abstracted into floating point weights. It no longer has the specifics to give you. It’s a shame that search engines have let this task (“ctrl-f the web”) fall by the wayside. It’s probably a large part of why people think Google search sucks now, it certainly is for me. (There’s this one essay about the Harappan civilization that I used to be able to find by searching for “strange builders mist of time”, I definitively remember that exact phrase working for me many years ago, and now it does not work and I cannot find that essay anymore.)
I agree: I do use it as a search engine myself for a bunch of things, but those tend to be things where I've developed a strong intuition that it's likely to give me a reasonable result.
People who haven't developed that intuition yet tend to run into problems - and will often then loudly proclaim that LLMs are evidently useless and shouldn't be trusted for anything.
Sure things in Wikipedia or official documents could be accurate, but the internet is still full of misinformation
This is more than just a fancy zip file of Markov sequences. Someone has got to put a stop to this silly line of reasoning, I'm not sure why more people familiar with the math of deep learning aren't doing their best to dispel this particular belief (which people will then use as the foundation for other arguments, and so on, and so on, and this is how misconceptions somehow become canon in the larger body of work).
I know the basics of deep learning and I found the article accurate.
I.e. one can think of it as a NERF of an underlying manifold instead of just assembling pictures taken of the manifold, which is an important distinction to make.
I.e. it learns the manifold, not the manifold samples. That's what makes it so powerful and lets it coherently mix and match very abstract concepts together. Even if it gets it wrong, one could link that to the fuzziness of a NERF where there is not as much data.
That's why this whole "average" business is silly nonsense. We're reducing the empirical risk over the dataset, not the L2 loss over it for Pete's sake.
I guess “hallucinate” stuck because it works across all disciplines: text, audio, vision…
That's because there is no way for the model to take the internet and separate fact from fiction or truth from falsehood. So it should not even try to, unless it can somehow weigh options (or preform its own experiments). And that doesn't mean counting occurrences, it means figuring out a coherent worldview and using it as a prior to interpret information, and then still acknowledging that it could be wrong.
You can get deterministic output (on a given machine) by setting temperature=0. The Chatgpt interface doesn't let you do that, but the playground API does.
More to the point, I don't think a "calculator for words" should be deterministic. Operating on language is much more subjective than operating on numbers. If anything, this is a human limitation that we expect only one answer to one question. I'm a contrarian to Chomsky's philosophy, as he's always been pessimistic of statistical language processing and often approaches from the more objective-side like grammar and parsing.
I'm waiting for the point where we can tap knowledge from Deep Learning models to build rule-sets that appease the deterministic crowd (and get the insight of what an LLM is really modeling). A breakthrough here could also help with two big problems a) alignment and b) copyright.
My pet theory is that editors aren't as good as they used to be. Market pressure to publish faster and faster in a vain attempt to keep up with the internet means that fewer of them are given the time and support to get really skilled. Thus resulting in ham fisted edits that jar me out of reading flow, and thence to analysing why.
(This pressure operates the other way too. Many authors' works are pushed out the door when they should have had more editing. )
With a calculator this is a feature. We want computations to be the same after all. Everyone should be able to get the same results when they enter the same numbers in. But this homogeneity doesn't belong in writing.
The hints are not calculated from the input, they're from the training set.
For example you can copy paste a page describing API documentation and ask an LLM to not only make an API call but then also interpret results. This is the most fascinating use of LLMs to me so far.
My go-to explanation is to think of ChatGPT like a really intelligent friend who's always available to help you out – but they're also super autistic, and you need to learn the best way to interact with them over time.
If it has the same seed, why would you get a different reply?
My favorite is Chalmer's engine bit:
> What pops out of self-supervised predictive training is noticeably not a classical agent. Shortly after GPT-3’s release, David Chalmers lucidly observed that the policy’s relation to agents is like that of a “chameleon” or “engine”:
>> GPT-3 does not look much like an agent. It does not seem to have goals or preferences beyond completing text, for example. It is more like a chameleon that can take the shape of many different agents. Or perhaps it is an engine that can be used under the hood to drive many agents. But it is then perhaps these systems that we should assess for agency, consciousness, and so on.6
Also, whatever loose rules it has are more literary than mathematical. Plot twists often work.
No, it is clearly not, and that is a very easily testable hypothesis.
Thank you for sharing.
In the analogy of the essay, your argument would be like saying that reality cannot be simply the application of quantum physics, because you are allowed to make new rules like Calvinball within reality which are different from the rules of quantum physics.
How can it explain them effectively if it's just a "stochastic parrot" or a "blurry jpeg"? Neither fits what it does.
Don't such effective explanations require some sort of understanding?
Below is an example explanation, and there are more here: [1]
---
Explain "In software systems it is often the early bird that makes the worm."
This statement refers to the importance of early planning and implementation in software systems. In software development, the early stages of the project, such as requirements gathering, design, and implementation, are crucial for the success of the project.
If the early stages of the project are done well, it can set the foundation for the rest of the development process, making it easier to build and maintain the software system. On the other hand, if these stages are not done well, it can lead to delays, rework, and budget overruns.
The phrase "the early bird that makes the worm" is a reference to the proverb "the early bird catches the worm", which means that those who act early and decisively will have an advantage over others.
Similarly, in software development, if the development team acts early and puts effort into the early stages of the project, they can gain an advantage over competitors, reducing the risk of project failure, and increasing the chances of success.
In summary, the statement "In software systems, it is often the early bird that makes the worm" emphasizes the importance of early planning and implementation in software development, which can set the foundation for the rest of the development process and increase the chances of success.
[1] - https://diaphora.blogspot.com/2023/04/chatgpt-explains-epigr...
It doesn't even match the basic math of the loss function, and implies a static snapshot instead of a decomposed, dynamic system that uses disentangled components to form a solution (whether incorrectly or correctly).
I.e. I feel it really downplays the beauty of what is happening, and that is something frustrating to me, especially when it's fairly straightforward mathematically that that is not at all the case of what's happening, at least from my personal experience/perspective.
This. The value proposition is very clearly tied to the quality of the training data, and if there's secret sauce for automatically determining information quality that's obviously huge. Google was built in part on such insights. I suspect they do have something. I'd be utterly astonished if quality sorting were an emergent property of LLMs (especially given it's iffy in humans).
The problem, of course, is that if they do have a way of privileging data for training, that information is going to be the center of the usual arms race for attention and thinking. It can't be truly public or it's dead.
I think it is worth pointing out that plenty of prose editors advised on words to remove from your sentences (IA Writer as an example) before LLMs were a thing. If they used one of these tools making suggestions, would they no longer be worth reading? What about the green squiggle of grammar errors before this?
I'm not too worried about "default GPT style" becoming common, though, because I think it's more likely to be used by the people who have no style beyond "what I see on TV and in my family." Raising the floor, basically.
Anyone who wants their writing to stand out will still have to differentiate themselves. To put it another way: you're gonna be able to recognize the lazy users pretty quickly cause they're gonna have "GPT voice."
We know there's no deeper level to the simulation/game because we have the entire "game history" (the chat history) and we understand it in approximately same way that the LLM does. (That's what the LLM was trained to do, understand and respond to text the same way we do.) We know that the bot has no hidden state when it's not the bot's turn because of how the bot's API works.
So there's nowhere for a deeper simulation to live. It's as shallow as it looks.
More:
https://skybrian.substack.com/p/ai-chats-are-turn-based-game...
One trick I use is to assume it has a "Wikipedia-level" knowledge of pretty much any topic. Often that's what I need! I want to ask some quick questions of someone eloquent who's read the Wikipedia article about something, to save me from having to read the whole thing myself.
If I need more expertise than you can get from reading Wikipedia I know that ChatGPT alone is very unlikely to cut it.
Anyway, I've run dozens of epigrams through it, almost none of which are based on any known proverbs, and it did great on almost all of them.
Even if you weren't impressed by this one, I encourage you to check out some of its other explanations at the link above.
"John Smith"
Where is it from?
"New York City"
I haven't tried to turn up the temperature but I assume that's needed to convince it to give unexpected responses.
There is definitely an element of averaging going on in these models and it's worth maintaining awareness of it. This is IMO also the cause of ChatGPTs odd disembodied voice. It's always projecting some modes in the data.
The above responses here are not too surprising to me -- first, we drastically trade in the variance for bias on a limited dataset. We have a very particular tone here that we're looking for, so bias towards the less spurious modes likely would be a good default for an 'instruction-following' mode.
After all, that very strong bias is what lets us give it instructions and stay in a back-and-forth conversation instead of devolving into something like a movie/book dialog or the middle of a news article about asking about people's names.
I could be entirely wrong about these assumptions, however, I only have my best guesses and (potentially spurious) information to go off of.
We do have a tiny dataset for ALPACA -- only about 55k paired examples, which is great, but that's either going to be a tiny run or not too difficult to overfit to. I'm sure there's good mitigations for that.
I agree that more unexpected responses would be good but I think it's sorta a mathematical "cake and have it too" that a lot of people want. Maybe this isn't at the pareto front for the compute/data/etc (almost certainly not to be honest, it's all pretty new after all!), but your above examples do show the bias-variance tradeoff well, and it looks like we got scammed on the exchange rates.
Running the same example on the base model (i.e., LLAMA with no ALPACA) will get you a representative model of names and such from the open internet that rapidly approaches the underlying distribution as the model size approaches infinity (so far as we know). This is because this is, as effectively as possible (I believe, and please correct me if I'm wrong), an unbiased estimator of the underlying distribution. We're just approaching that minimum variance limit as we minimize the cross-entropy loss.
Necessarily the cross-entropy loss is not minimized under biased circumstances. So a raw language model on the full corpus of text would be brutally penalized* for simply sampling the main or most likely modes of the distribution, whereas for the sake of instruction following this may not be the case.
And of course this is just the surface argument about the statistics of the models, personally the actually interesting things to me are the concept factorizations that happen internal to the (raw next-word-generation) models under cross-entropy -- especially as both the model size and the amount of data grows. Then you start seeing those trends that would only occur when the model is able to disentangle the concept data from the raw statistics, which is I think rather straightforward. One can pull out their frequentist slide rule and mark this down to a T if they'd like to.
Now, this of course transforms wonderfully under the usecase of the fine-tuned instruction-following model usecase. While indeed hampered by bias (to include perhaps the strongest of all: "as an AI model, I cannot blah blah blah blah...."), we see these disentangled concepts take flight as the user asks for varied cross-domain and multi-modal usecases. To me, it's probably the best test to see what's survived the fine-tuning and what hasn't. And we can see that in many of these large language models with RLHF and the like -- indeed it has! It can be quite wrong, but lots of people seem to jump on the "it's just statistical hallucinations" without really asking _why_ it's doing what it's doing. If it was merely a word-chain hallucination, then I don't believe people would be fooled as easily. To me, it's just a standard limits-of-out-of-distribution inference problem that we see with practically every other (non-symbolic) network out there.
That's the cool thing. Neural networks usually don't do this well OOD and that is special.
One other point to the John Smith example and New York is that there's not really a 'good' response to that question other than a perfect distribution of names and places matching the real world. I'm not sure what I'd want to see there, and I guess it depends upon the data it was fine-tuned on. Maybe that is a good test, it is out of my realm of experience with that particular dataset, however.
Hopefully this clears up some of the subtleties between the base and RLHF models and the tradeoffs/etc etc within, at the very least from the perspective I'm coming at it from.
I do really want to emphasize that this is the reason I'm frustrated that people who are skilled in DL are allowing this messaging to happen without reinforcing the math behind it. There's a second, much much much more interesting discussion about the actual structural and developmental elements of these networks that is getting displaced by this discussion, which to me is a bit more surface level and leads itself in circles to the same conclusions we have arrived to for other neural networks, and we generally have little to show for it in the end. Like, we've really got to get a move on to the core of the development of the network during training and the intrinsics of how info theory is specifically impacting these models during training. That's the most interesting area to me and it's what I've learned the most from focusing on. Let's focus as a field on that instead, it's cool stuff and has much more of an effective impact in the long run. <3 :'''')
*(In a large-enough batchsize domain, which we do see with the enormous token batchsizes in LLMs)
For a number of years we've been basically showing the first to be the case, especially as the model is scaled and the context increases, differentially against the second. String density probabilities can be surprisingly brittle, to be honest. The curse of dimensionality applies to them too, believe it or not, which I believe is why topic discussion, reasoning, and integration over longer distances of text is that differential test that shows pretty clearly that substring memorization/text density stuff is not 'just' what the model is learning. Because mathematically/statistically/from an information density perspective/etc etc otherwise it would be basically impossible, I think.
That's my best understanding, at least.
-----
#import ~/work/gish/tasks/coding.txt
Change the following so that it looks for the open AI key in the following fashion:
1. env variable
2. os.home()/.openai
3. Throws an exception telling the user to put it in one of the above, and then exits
#diff ~/work/gish/src/LLM.ts
-----
Puts me in vimdiff comparing the old code with the generated code letting me review and cherry pick the changes.
I haven't seen anyone else describing this workflow. Feed it the existing code, ask it to modify/improve/fix the code and output a new version of all the input code, review diffs.
It has downsides, because you can easily run out of context window of chatgpt-3.5-turbo. But I am getting much better code out of it versus other approaches I've tried. And it's a very efficient and natural workflow -- we're used to getting and reviewing diffs/PRs from human collaborators.
It's actually $0.002/1k, FYI