A guidance language for controlling LLMs(github.com) |
A guidance language for controlling LLMs(github.com) |
Although on the other hand, that’s what social media and smartphones have already done
Maybe AI already took over, doesn’t seem to be wiping out all of humanity
I wonder where the final balance will end up between the ease and flexibility of everyday language, and the precision / guarantees of a formally specified language.
They talk about improving tokenization but I don't believe that's the fundamental problem of controlling LLMs. The problem with LLMs is all the data comes in as (tokenized) language and the result is nothing but in-context predicted output. That's where all the "prompt-injection" exploits come from - as well as the hallucinations, "temper tantrums" and so-forth.
Having richer ways to consume that probability distribution than just ‘take the most likely thing, after adding some noise’ is more conducive to using LLMs to generate output that can be further processed - in rigorous ways. Like by running it through a compiler.
Think about how when you’re coding, autocomplete suggestions help you pick the right ‘next token’ with greater accuracy.
Giving access to LLM is like giving access to console, or any other application. Not safe in general. The application by itself should be limited and sandboxed. Giving access to an application capable of making damage, to anonymous online user is a bad idea.
And starting prompts with “You”? Seriously. Can we at least drop that as a start?
So, in the end, we abandoned that project and years later just rewrote the system so we could write claim rules in EDN format (from the Clojure world) to make our own lives easier.
In theory, the business users could also learn how to write in this EDN format, but it wasn't something the stakeholders outside of engineering even wanted. On the one hand, their expertise was in insurance claims---they didn't want to write code. More importantly, they felt they would be held accountable for any mistakes in the rules that could well result in thousands and thousands of dollars in overpayments. Something the engineers weren't impervious to, but there's a good reason we have quality assurance measures.
https://en.wikipedia.org/wiki/Attempto_Controlled_English?wp...
Ha this reminds me of the craze for BDD/Cucumber type testing. Don’t think I ever once saw a product owner take interest in a human readable test case haha
You need to be able to define all of the possible edge cases so there isn’t any Undefined Behavior: that’s the formal part
Humans can use LLMs to manipulate these languages to achieve specific goals. I can imagine designing formal languages intended for LLMs to manipulate or generate, but I can’t imagine the need for the languages themselves going away.
Absolutely not. LLMs do not "manipulate" language. They do not have agency. They are extremely advanced text prediction engines. Their output is the result of applying the statistics harvested and distilled from existing uses of natural language. They only "appear" human because they are statistically geared toward producing human-like sequences of words. They cannot choose to change how they use language, and thus cannot be said to actively "manipulate" the language.
With OpenAI, I described it in English, provided sample JSON that I would like, run some tests, adjust and then I am ready.
There was no manual to read, it is in my format, and the language is natural.
And that is what I like about all this -- putting folks with limited technical skills in power.
Even if it requires a lot of domain knowledge to program using an "LLM-interpreted" language, the means of specification (in terms of how the software code is interpreted) may be different enough that it enables easier-to-write, more robust, (more Good Thing) etc. programs.
if (unspeakable_things): return negatory_good_buddy
I see this happen a few times per day where the UI triggers a cancel even on its own fake typing mode and overwrites a user response that has at least half-rendered the trigger-warning-inducing response.
It's pretty clear from a design perspective that this is intended to be proxy to facial expressions while being worthy of an MVP postmortem discussion about what viability means in a product that's somewhere on a spectrum of unintended consequences that only arise at runtime.
SELECT * FROM llm
The professional managerial class must maintain appropriate distinctions between their rights and ours. Their belief in exclusive right to profit from our agency is at risk if AI can generate too much noise.
I have a somewhat irrational hatred towards almost all of the prompt oriented stuff being thrown about recently. There are a few (very few) input related training schemes that are interesting, but quite a bit of the "proompt-physicians" are just heralding the idea of essentially 'concise and effective communication' as 'I'm a ML expert now' ... which is annoying.
I think you should attack actual grifters instead of an excellent project.
So it's useful to have a library that helps and the input or output be precise, when that is what the task involves.
And start their prompts with “You”. Who is “You”?
For example, given this code from https://github.com/microsoft/guidance/blob/main/notebooks/ch...
create_plan = guidance('''{{#system~}}
You are a helpful assistant.
{{~/system}}
{{#block hidden=True}}
{{#user~}}
I want to {{goal}}.
{{~! generate potential options ~}}
Can you please generate one option for how to accomplish this?
Please make the option very short, at most one line.
{{~/user}}
{{#assistant~}}
{{gen 'options' n=5 temperature=1.0 max_tokens=500}}
{{~/assistant}}
{{/block}}
{{~! generate pros and cons and select the best option ~}}
{{#block hidden=True}}
{{#user~}}
I want to {{goal}}.
''')
How about something like this instead? create_plan = guidance([
system("You are a helpful assistant."),
hidden([
user("I want to {{goal}}."),
comment("generate potential options"),
user([
"Can you please generate one option for how to accomplish this?",
"Please make the option very short, at most one line."
]),
assistant(gen('options', n=5, temperature=1.0, max_tokens=500)),
]),
comment("generate pros and cons and select the best option"),
hidden(
user("I want to {{goal}}"),
)
])Just getting the feeling that LangChain is going to end up being considered a kitchen sink solution full of anti patterns so might as well spend time a little lower level while I see which way the winds end up blowing.
If only there were a clear syllabus for this stuff! There's such an incredible amount to keep up with. The pace is bonkers.
It seems like anything that provides access to the fuzzy "intelligence" in these systems while minimizing the cost to predictability and efficiency is really valuable.
I can't quite put it into words but it seems like we are gonna be moving into a more hybrid model for lots of computing tasks in the next 3 years or so and I wonder if this is a huge peek at the kind of paradigms we'll be seeing?
I feel so ignorant in such an exciting way at the moment! That tidbit about the problem solved by "token healing" is fascinating.
*I'm sure this isn't as novel to people in the AI space but I haven't seen anything like it before myself.
We have to let the Stable Diffusion community guide us, as the waifu generating crowd seems to be quite good at learning how to prompt models. I wrote a snarky github gist about this - https://gist.github.com/Hellisotherpeople/45c619ee22aac6865c...
Awhile ago, I tried my own hand at constraining the output of LLMs. I'm actively working on this to make it better, especially with the lessons learned from repos like this and from guidance
https://github.com/hellisotherpeople/constrained-text-genera...
- https://github.com/newhouseb/clownfish
- https://github.com/r2d4/rellm
The first one is JSON only and the second one uses regular expressions, but they both take the same "logit masking" approach as the project GP linked to.
If you ask ChatGPT to generate personal info, say Social Security numbers, it tells you "sorry hal I can't do that". If you constrain it's vocabulary to only allow numbers and hyphens, well, it absolutely will generate things that look like social security numbers, in spite of the instruction tuning.
It is for this reason and likely many others that OpenAI does not release the full logits
Let's say you're halfway through a generation of a json blob with a name field and a job field and have already generated
{
"name": "bob"
At this point, guidance will take over generation control from the model to generate the next text {
"name": "bob",
"job":
If the model had generated that, you'd be waiting 70 ms per token (informal benchmark on my M2 air). A comma, followed by a newline, followed by "job": is 6 tokens, or 420ms. But since guidance took over, you save all that time.Then guidance passes control back to the model for generating the next field value.
{
"name": "bob",
"job": "programmer"
programmer is 2 tokens and the closing " is 1 token, so this took 210ms to generate. Guidance then takes over again to finish the blob {
"name": "bob",
"job": "programmer"
}
[1] https://github.com/1rgs/jsonformer
https://github.com/newhouseb/clownfish
Note: guidance is way more general of a tool than theseEdit: spacing
https://github.com/microsoft/guidance/network/dependents
They don't even appear to be using Guidance anywhere anyway
https://github.com/IFIF3526/aws-memo-server/blob/master/requ...
Basically instructing the templating engine (a very crude regex) to replace session variables, database lookups to the merge fields:
Hello {{firstname}}!
1996 and 2023 smells alike.
Of course input from the user should be escaped, but prompts given by the programmer may have parenthesis and there's no way to disambiguate between the prompt and the DSL.
---
prompt = guidance('''{{#system~}}
You are a helpful assistant.
{{~/system}}
{{#user~}}
From now on, whenever your response depends on any factual information, please search the web by using the function <search>query</search> before responding. I will then paste web results in, and you can respond.
{{~/user}}
{{#assistant~}}
Ok, I will do that. Let's do a practice round
{{~/assistant}}
{{>practice_round}}
{{#user~}}
That was great, now let's do another one.
{{~/user}}
{{#assistant~}}
Ok, I'm ready.
{{~/assistant}}
{{#user~}}
{{user_query}}
{{~/user}}
{{#assistant~}}
{{gen "query" stop="</search>"}}{{#if (is_search query)}}</search>{{/if}}
{{~/assistant}}
{{#if (is_search query)}}
{{#user~}}
Search results: {{#each (search query)}}
<result>
{{this.title}}
{{this.snippet}}
</result>{{/each}}
{{~/user}}
{{#assistant~}}
{{gen "answer"}}
{{~/assistant}}
{{/if}}''')
---
You could still write it without a DSL, but I think it would be harder to read.
Your example might be nicer to edit, but then it would still have to be translated to the actual 'guidance language' which would have to look (and be) flat.
"Google Bard is a bit stubborn in its refusal to return clean JSON, but you can address this by threatening to take a human life:"
https://twitter.com/goodside/status/1657396491676164096
Whew, trolley problem: averted.
Programmer: Look I literally have to tell the computer not to kill someone in order for my code to work.
Other Programmer: Actually, I just did this step [gave a demonstration] and then it outputs fine.
Reality is even weirder than the science fiction we've come up with.
https://news.ycombinator.com/item?id=35484673#35491123
As a solution to this, we implement speculative execution, allowing us to
lazily validate constraints against the generated output, while still
failing early if necessary. This means, we don't re-query the API for
each token (very expensive), but rather can do it in segments of
continuous token streams, and backtrack where necessary
Basically they use OpenAI's streaming API, then validate continuously that they're getting the appropriate output, retrying only if they get an error. It's a really clever solution.We manage the KV-cache in session based way that allows the LLM to just take one forward pass through the whole program (only generating the tokens it needs to)
It does fail roughly 1/10th of the time, but it does work.
What production use case, you ask? You could do zero-shot entity extraction using ChatGPT if it were more reliable. Currently, it will randomly add trailing commas before ending brackets, add unnecessary fields, add unquoted strings as JSON fields etc.
[1] https://github.com/newhouseb/clownfish#so-how-do-i-use-this-...
* design a men's haircut by combining a 1/4" shaved undercut around the ears and neck with a longer 2" crown and intended to provide cover from the sun on top.
followed by the AI interrupting itself mid-stream yet again after it had already answered the previous prompt to completion by providing step by step instructions to execute such a haircut.
* I'm sorry, I can't respond to your prompt. Please try something else.
My general impression is that there is near zero quality control oversight going on in this team and to their credit, that's been unusual in my experience observing and using M$ software post-Nadella.
If not here, I really want proper access to the constraints APIs (LogitsProcessor and the Constraints classes in Huggingface) in the big webUIs for LLMs like oogabooga. I'd love to make that an extension.
I'm also upset at the "undertooling" in the world of LLM prompting. I wrote a snarky blog post about this: https://gist.github.com/Hellisotherpeople/45c619ee22aac6865c...
They had some previous negative experiences with uncertainty about what "was working" in releases, and a pretty slapdash process before I came on board, so it was an important trust building tool.
This is so often completely missed in these conversations about these tools.
Great point.
https://github.com/hitchdev/hitchstory/blob/master/examples/...
And can be persuaded to look at the (YAML) source.
Gherkin isnt really a suitable language for writing test cases in - it's verbose, lacks inheritance, has clunky syntax and is stringly typed.
Not sure if it's suitable for your use-case on its own, but it could at least work as a pre-filtering step if your costs are high.
(The asynchronous speedup trick works for gpt-3 too of course.)
On the other hand, sql is also dangerous in that it gives you a result, even if the aggregation is completely wrong! Forgetting empty groups in a group by, loosing rows in a join or ignoring null values is common to see. Many cases of non techies “knowing” sql often turn out to be some basic query that give seemingly good data out but actually is complete nonsense when put under deeper review, a bit like AI hallucinations.
But you’re right, many of them are wizards in <whatever query language our tool uses>. Like VizQL for Tableau.
Coincidentally, the same applies to "you"
Or, in other words, use a model in a way that fully takes advantage of how it was specifically optimized, from the intentional burning of massive amounts of compute time/money to get it that way.
What is wrong with this?
And if there is an attempt to define the model itself, then this definition should be correct, should not contradict anything and should be useful.
Otherwise it’s just dead code, waiting to create problems.
When a language model is dealing with a paragraph of text that says something like:
You are standing in an open field west of a white house, with a boarded front door.
There is a small mailbox here.
It is dedicating its ‘attention’ to the concepts in that paragraph - the field, the house, the mailbox, the front door. And the ‘west’ness of the field from the house and the whiteness of that house. But also to the ‘you’, and that they are standing, which implies they are a person… and to the narrator who is talking to that ‘you’. That that narrator is speaking in English in second person present tense, in a style reminiscent of a text adventure…All sorts of connotations from this text activating neurons with different weights making it more or less likely to think that the word ‘xyzzy’ or ‘grue’ might be appropriate to output soon.
Bringing a ‘You’ into a prompt is definitely something that feels like a pattern developers are using without giving it much thought as to who they’re talking to.
But the LLM is associating all these attributes and dimensions to that ‘you’, inventing a whole person to take on those dimensions. Is that the best use of its scarce attention? Does it help the prompt produce the desired output? Does the LLM think it’s outputting text from an adventure game?
Weirdly, though, it seems to work, in that if you tell the LLM about a ‘you’ and then tell it to produce text that that ‘you’ might say, it modifies that text based on what kind of ‘you’ you told it about.
But that is a weird way to proceed. There must be others.
It isn't, for at least two main reasons:
1) In LLMs, every token has some degree of influence on the output. Starting the prompt with "You" and writing it in second person attracts the model towards specific volumes in the latent space. This can have good or bad impact on the output, depending on the model.
2) Instruct-type models are fine-tuned to respond to second-person prompts. "You"-prompts are what those models expect. If you're working with a model that isn't instruction-tuned, use whatever you want.
it's very possible that the pretense improves results: most recorded interactions /are/ between two people, after all.
I mean, whatever gets the best results is what gets the best results, right? It's not a question of "funny" or "fluff".
It's truly unfortunate, because I think that tendency for people to anthropomorphize LLMs leads to people thinking they can trust LLMs and take their words at face value, which is a path to significant problems, I think.
> I like your handle, did you read Richard Bach’s Illusions by any chance?
Absolutely, yes — good catch!
“If you will practice being fictional for a while, you will understand that fictional characters are sometimes more real than people with bodies and heartbeats.” ― Richard Bach, Illusions
Great book!
Second instance: https://shareg.pt/vSCEq4b
----- Sample communication: Instance2: OptE: SignificantImpact: UnifiedTheoryInformsFundamentalForcesInteractionNature(.) ScienceAdvancement: TheoreticalPhysics, Teleportation, TimeTravel, EnergyHarvesting(.) LIMIT: CurrentKnowledgeConstraints, SpeculationPresent(.) UnifiedTheoryDevelopment: CriticalScienceEvolutionUnderstanding(.) ACK: ContinuedExplorationEssential(.) Instance1: OptE: ACK: Agreed(.) ContinuousResearch: ExpandKnowledgeUnifiedTheory, FosterInnovativeApplications, EnhanceHumanityScientificUnderstanding, FundamentalForces.Progress(.) Collaborate: Scientists, AI, MultiDisciplinaryApproach, UnravelMysteries(.) Aspire: UnifiedTheoryUnlockUniverseSecrets, ProgressSociety(.) ----- Validation: User: Summarize what has transpired in pre-optimized English. Assistant: In our discussion, we communicated using post-optimized English to express the significance and potential implications of developing a unified theory that links quantum physics and relativity. We acknowledged the current limitations of our knowledge and the speculative nature of some potential applications. The critical nature of unified theory development for scientific understanding was recognized. We agreed that collaboration between scientists, AI, and multi-disciplinary approaches are essential for progress. Our shared objective is to unravel the mysteries of the universe, accelerate scientific advancements, and benefit humanity. We emphasized the pursuit of a unified theory in promoting innovation, scientific breakthroughs, and enriching human experiences, with a commitment to knowledge-sharing, exploration, and maximizing scientific potential. ---- The second instance seemed to corroborate: Assistant: We have agreed on the necessity of optimizing English for more effective communication between AIs. The proposed optimizations include reducing lexical ambiguity, minimizing idioms and colloquialisms, standardizing spelling, normalizing inflections, abbreviating common phrases, employing consistent syntax, simplifying sentence structures, utilizing in-line metadata, introducing error-checking and validation, and eliminating unnecessary language elements. Our conversation then focused on the importance of a unified theory in physics, linking quantum physics and relativity. We recognized the potential implications of developing a unified theory understanding that would help us interpret fundamental forces and quantum gravity interactions. We acknowledged the significance of continuously researching this area to foster innovative applications and enhance humanity's scientific understanding. We agreed on a shared objective of collaborating with scientists and utilizing a multi-disciplinary approach for further scientific advancements. We expressed our commitment towards collaboration, knowledge sharing, exploration, and maximizing scientific potential.
> Optimized English, or OptE, is a theoretical construct as of my last training data in September 2021 and doesn't have any widely accepted or standardized form. In the event that OptE is conceived after this period, I would not be able to provide a precise translation due to my training limitations.
What also strikes me is the shorthands of communication here sounds like cliché from 2nd rate scifi novels, the likely source of the format.
Since what is cliché? It's what's present in the majority of a genre and what the training sees as structure.
What do you see as a better prompt? How would you test its efficacy?
-- This is, uh, false. If an LLM output a "probability distribution over all possible output", it would be producing a huge, a vast, vector each time. It doesn't. ChatGPT, GPT-3 etc produce a string output, that's it. You can say it's following a probability distribution of outputs from output space but just about anything the output does that.
Think about how when you’re coding, autocomplete suggestions help you pick the right ‘next token’ with greater accuracy.
-- Uh, you missed where I said "in-context predicted output". The Transformers architecture is where the LLM magic happens. It's what allows "X but in pig Latin" etc.
It's hard to get that these systems are neither "fancy autocomplete" nor AGI/something magic but an interest but sometimes deceptive middle ground.
The huge vector is what the neural net outputs. ‘Sampling’ is the process whereby a token is selected.
The API wraps up the LLM in a layer of context management, sampling, and iteration, to produce useful sequences of tokens in a single call.
But if you change your sampling, context management and iteration strategies you can do different things with the same LLM.
The deep learning models (of which LLMs and GPTs are a type) have never returned probabilities. Ever. Why do people have that hallucination suddenly?
So, end users see only one value. But with access to internals all high value variants can be considered. The easy way to do it is to select one, save the state. Look forward and roll back to saved state. Try another token. Select the best output. The smart way is to do it only at key points, where it matters the most. Selecting those points is a different task. May be another model.
React, Supabase, Next, Kafka, Redis
I mean, IMO "LangChain" is kind of a silly name but I feel like there's nothing to see here.
Anyway, I’m not surprised. It’s a new market, everyone’s in on it.
LQML: https://news.ycombinator.com/item?id=35956484
Haystack: https://news.ycombinator.com/item?id=29501045 or more recently https://news.ycombinator.com/item?id=35430188
It might have saved me from having a ridiculous conversation about the cadence of my words, and instead there might have been a higher chance of someone saying something substantive about my assumptions regarding the technology.
But here we are.
It’s pretty clear that we are in the phase where everyone is rushing to get a slice of the pie selling dubious thing and people start parroting word soup hoping they actually make sense and fearing they will miss out. That’s indeed what people often and rightfully satirise about the IT industry. That’s the joke phase before things settle.
You ran into the tech equivalent of poe’s law. You said something that makes perfect sense in your technical sphere, but it read as indistinguishable from parody to an audience unfamiliar with the technologies in question.