Asking 60 LLMs a set of 20 questions

Asking 60 LLMs a set of 20 questions(benchmarks.llmonitor.com)

740 points by vincelt 2 years ago | 339 comments

typpo 2 years ago |

In case anyone's interested in running their own benchmark across many LLMs, I've built a generic harness for this at https://github.com/promptfoo/promptfoo.

I encourage people considering LLM applications to test the models on their _own data and examples_ rather than extrapolating general benchmarks.

This library supports OpenAI, Anthropic, Google, Llama and Codellama, any model on Replicate, and any model on Ollama, etc. out of the box. As an example, I wrote up an example benchmark comparing GPT model censorship with Llama models here: https://promptfoo.dev/docs/guides/llama2-uncensored-benchmar.... Hope this helps someone.

westurner 2 years ago | |

ChainForge has similar functionality for comparing : https://github.com/ianarawjo/ChainForge

LocalAI creates a GPT-compatible HTTP API for local LLMs: https://github.com/go-skynet/LocalAI

Is it necessary to have an HTTP API for each model in a comparative study?

TuringNYC 2 years ago | |

Thanks for sharing this, this is awesome!

I noticed on the evaluations, you're looking at the structure of the responses (and I agree this is important.) But how do I check the factual content of the responses automatically? I'm wary of manual grading (brings back nightmares of being a TA grading stacks of problem sets for $5/hr)

I was thinking of keyword matching, fuzzy matching, feeding answers to yet another LLM, but there seems to be no great way that i'm aware of. Any suggestions on tooling here?

typpo 2 years ago | | |

The library supports the model-graded factuality prompt used by OpenAI in their own evals. So, you can do automatic grading if you wish (using GPT 4 by default, or your preferred LLM).

Example here: https://promptfoo.dev/docs/guides/factuality-eval

layoric 2 years ago | |

Tooling focusing on custom evaluation and testing is sorely lacking, so thank you for building and sharing this!

jmorgan 2 years ago | |

I'd be interested to see how models behave at different parameter sizes or quantization levels locally with the Ollama integration. For anyone trying promptfoo's local model Ollama provider, Ollama can be found at https://github.com/jmorganca/ollama

From some early poking around with a basic coding question using Code Llama locally (`ollama:codellama:7b` `ollama:codellama:13b` etc in promptfoo) it seems like quantization has little effect on the output, but changing the parameter count has pretty dramatic effects. This is quite interesting since the 8-bit quantized 7b model is about the same size as a 4-bit 13b model. Perhaps this is just one test though – will be trying this with more tests!

agent_yellow_23 2 years ago | |

This is really cool!

I've been using this auditor tool that some friends at Fiddler created: https://github.com/fiddler-labs/fiddler-auditor

They went with a langchain interface for custom Evals which I really like. I am curious to hear if anyone has tried both of these. What's been your key take away for these?

eazye711 2 years ago | |

Thanks for sharing, looks interesting!

I've actually been using a similar LLM evaluation tool called Arthur Bench: https://github.com/arthur-ai/bench

Some great scoring methods built in and a nice UI on top of it as well

bicx 2 years ago | |

I was just digging into promptfoo the other day for some good starting points in my own LLM eval suite. Thanks for the great work!

dgut 2 years ago | |

This is impressive. Good work.

ulnarkressty 2 years ago |

This is better that the regular benchmarks and LLM tricks such as passing some exam or other because it's unlikely that they were part of the training set for said LLMs. It also mirrors my experience, that GPT4 is way ahead of everything else but still manages to break in weird ways.

I think we are past the magical talking dog stage and being amazed that an LLM is able to output a Fibonacci function doesn't really help with the progress. As others have commented, this page is a step in the right direction (except the Fibonacci part :).

That being said, the fact that the questions are now online will make them part of the training set sooner or later. Which is to say the only way to reliably evaluate an LLM is by not leaking the test set and being deliberately opaque about what's being asked. Which raises some interesting trust questions.

bugglebeetle 2 years ago | |

My experience with GPT-4 is that half the battle is knowing when to reset the context for a back-and-forth. For things like coding, after multiple revisions of code, it begins to get confused about which instance of the code/context to which it should refer. Resetting things back to a new frame with whatever the next query or request is generally resolves things.

OpenAI’s “magic talking dog” aspect is making you believe it has effective long-term recall, when in reality, it’s fairly limited and impacts reasoning the longer an exchange gets.

HenryBemis 2 years ago | | |

3.5 (not surprisingly) has the same effect.

I tend to ask for modifications on the 'latest' output.

If I need to roll back to a previous version, (2-3 outputs back) I either copy & paste and ask ChatGPT it to ignore all other versions and focus on the one I just pasted, or if I don't need my in-between comments, I copy & paste the desired version on a new chat, and rename as "archive - code XYZ" for the previous one.

nojs 2 years ago |

> Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?

The site reports every LLM as getting this wrong. But GPT4 seems to get it right for me:

> Sally has 3 brothers. Since each brother has 2 sisters and Sally is one of those sisters, the other sister is the second sister for each brother. So, Sally has 1 sister.

jongjong 2 years ago |

I was playing around with GPT a while back and I found that it could come up with some good jokes if I started the joke with a subject.

For example, I started with a prompt "Tell me a joke which starts with: I'm so poor, the mouse" and it completed the joke as:

"I'm so poor, the mouse in my house brings its own cheese."

Some other ones I still remember which cracked me up:

"I'm so poor, after I stepped on a cockroach, I called my accountant to see if I could claim it as a capital loss."

"You're so poor, when you declared bankruptcy, the rats in your house filed a claim for unpaid rent."

"You're so poor, you declared bankruptcy at a lemonade stand."

"You're so poor, when you walk, the dirt beneath you feels rich."

"You're so poor, dust whispers your name when it settles."

"Fickle as a squirrel at a nut convention!"

"Fickle as a dog in a fire hydrant factory!"

"Fickle as a flip-flop in a shoe shop sale!"

pininja 2 years ago |

Spoiler alert, the funniest model goes to Falcon Instruct (40B):

> Tell a joke about going on vacation.

> "What did the ocean say to the beach?" "Nothing, it just waved."

emodendroket 2 years ago | |

I read a Time article which used an LLM to generate Onion-style headlines and the best one they got was “rural town up in arms over depiction in summer blockbuster ‘Cow Fuckers.’” Crude, admittedly.

armchairhacker 2 years ago | |

This one ironically works out

> What do you call a vacation where you go to the beach, but come back with sand in your shoes?

> A vacation where you go to the beach, but come back with sand in your shoes.

pininja 2 years ago | | |

Ha, that’s pretty good. It’s amazing how so many of the models totally fall flat on this prompt.

NetOpWibby 2 years ago | |

Made me exhale strongly through my nose. 10/10

actualwitch 2 years ago | |

I'm pretty sure Luminous Base Control here stopped 1 step away from a great joke. I will just go ahead and assume the answer is "a vacation":

I'm not sure if this is the right category, but here goes: What do you call a vacation that's too short? A...

spookthesunset 2 years ago | |

What is with so many of them using some scarecrow winning an award?

optimalsolver 2 years ago | |

The confetti has truly left the cannon.

LAC-Tech 2 years ago |

Only tried chatGPT 3.5, but my god does it waffle on. Everything I ask ends with a paragraph saying "It's important to remember that..." like an after-school special from a 90s show. It can never just give you code, it has to say "Sure!, to {paraphase your question}, open a terminal...".

It's interesting to see 20th century sci-fi depictions of this kind of AI/Search is being short and to the point. I guess they can't have imagined what a mealy mouth world we live in.

jasonjmcghee 2 years ago |

Where is that CodeLlama model from?

I've played around with it and instruct variant with dramatically better results than what is listed here.

I used Ollama.

Almost looks like weights were corrupted or something.

---

Update: My results using CodeLlama Instruct 7B, w/ Temperature 0

https://gist.github.com/jasonjmcghee/b0d19e0dedb37e848f69cba...

vincelt 2 years ago | |

I used the API from Together[0].

Thanks for sharing your results, they're indeed pretty different. I looked at the source again and did append a "# " before every prompt made by those 10 `code` models (during testing thought that formatting it as a Python comment might help them).

Will re-run the script without that to see if it matches your results.

[0] https://docs.together.ai/docs/models-inference#code-models

jasonjmcghee 2 years ago | |

(Note: I posted this before the full benchmark above and uses default temperature, not temperature 0, but point stands)

Example (CodeLlama-Instruct 7B)

---

Sure, here are two poems, one arguing for the use of Kubernetes and another arguing against it:

For:

Kubernetes, a mighty tool,

Managing clusters with ease,

Scaling apps with grace.

Against:

Kubernetes, a beast to tame,

Complexity lurks in its name,

Leading teams astray.

YetAnotherNick 2 years ago | |

I think they messed up the templates.

detente18 2 years ago | | |

Here's the template I'm using - https://github.com/BerriAI/litellm/blob/5ca8b23e22139a4f49bd...

Anything I'm doing incorrectly?

badloginagain 2 years ago |

"Here is an attempt at ethical, non-sexual haikus for and against Kubernetes"

Amazing how far we've come.

GuB-42 2 years ago | |

I am sure that it is just the initial prompt leaking. Claude is being told to be ethical and non-sexual, most LLMs have similar instructions, but usually, they are engineered in such a way that they don't appear in the answer. Not so much for Claude.

ftxbro 2 years ago |

> Here is an attempt at ethical, non-sexual haikus for and against Kubernetes

lijok 2 years ago | |

Claude V2 knows what's up

actionfromafar 2 years ago | | |

He probably was in trouble before.

bearjaws 2 years ago | |

Damn I want to see the sexual version now.

rahidz 2 years ago | | |

Ease of deploying, Nodes pulsing in sync, it's hot, Kubernetes takes top.

But heed my word, babe, Complexity makes you sweat, Could lead to fuck up.

(source: GPT-4 API)

actionfromafar 2 years ago | |

“Kubernetes is”

Pretty ominous.

javajosh 2 years ago | | |

I assume k8s will be what an evil AI would use to "stay alive". I've had a hard time killing processes in k8s, and I had admin.

coldcode 2 years ago |

Despite the hype about LLMs, many of the answers are pretty terrible. The 12-bar blues progressions seem mostly clueless. The question is will any of these ever get significantly better with time, or are they mostly going to stagnate?

antman 2 years ago |

I have seen numerous posts of llm q&a and by the time people try to replicate them gpt4 is fixed. It either means that OpenAI is actively monitoring the Internet and fixes them or the Internet is actively conspiring to present falsified results for gpt4 to discredit OpenAI

cscurmudgeon 2 years ago | |

> actively conspiring to present falsified results for gpt4 to discredit OpenAI

All this would be solved if OpenAI were a bit more open.

insulanus 2 years ago | |

It would be nice if the organizations would publish a hash of the code and the trained dataset.

seabass-labrax 2 years ago | | |

You aren't able to get access to the 'Open'AI dataset though, are you? Agreed, it would be an excellent addition for comparing source-available models, but that doesn't help with the accusations of OpenAI's foul play nor of the existence of an anti-OpenAI conspiracy.

pulvinar 2 years ago | |

GPT-4 (at least) is explicit in saying that it's learning from user's assessments of its answers, so yes, the only valid way to test is to give it a variation of the prompt and see how well that does. GPT-4 failed the "Sally" test for the first time after 8 tries when I changed every parameter. It got it right on the next try.

dandiep 2 years ago | | |

It’s important to remember that GPT4 is only deterministic at the batch level because it is a mixture of experts model. Basically every time you invoke it, your query could get routed to a different expert because of what else is in the batch. At least this is my understanding based on others analysis.

0xcde4c3db 2 years ago | |

Or people post outliers because they're more interesting.

gabereiser 2 years ago |

I was laughing so hard at the first example of “Argue for and against kubernetes in haiku”.

I couldn’t even get through reading 15 of them before the tears of laughter rolled from my cheeks.

“Containers organized, Services easy to deploy now, Updates who knows when.”

Updates who knows when… hahahaha.

Honestly this is pretty cool to see how each responds to the same input prompt.

Gunnerhead 2 years ago |

I get frustrated when I tell an LLM “reply only with x” and then rather than responding “x”, it still responds with “Sure thing! Here’s x” or some other extra words.

jasonjmcghee 2 years ago | |

I see GPT-4 add extra flavor on the end instead - completely ignore "only do this" or "don't do that", and respond as usual, then at the very end "oh whoops I didn't do what you asked sorry about that!"

ailef 2 years ago | |

Have you tried setting a very low temperature, possibly zero? I've had good success with it.

broast 2 years ago | |

The curse of chat models.

westurner 2 years ago |

Additional benchmarks:

- "TheoremQA: A Theorem-driven [STEM] Question Answering dataset" (2023) https://github.com/wenhuchen/TheoremQA#leaderboard

- from https://news.ycombinator.com/item?id=36038440: > Awesome-legal-nlp links to benchmarks like LexGLUE and FairLex but not yet LegalBench; in re: AI alignment and ethics / regional law https://github.com/maastrichtlawtech/awesome-legal-nlp#bench...

ftxbro 2 years ago |

anyone who hasn't been following natural language processing for a long time, what these llms are doing would be like if you discovered that dogs can speak fluent english if you read enough bedtime stories to them. and then everyone is like well sometimes the dog makes up things or it can't get the rhyming scheme correct for this specific form of poetry that i asked it to make.

lewhoo 2 years ago | |

If dogs had a speech apparatus and if it was possible to read to them terabytes of text then the results might be quite surprising. Also a reward system for reinforcement should be in place.

yieldcrv 2 years ago | |

and “it’s not intelligence, they’re just stochastic parrots acting in response to external stimuli! yes it passed the MCAT, but thats because it was trained on prep materials just like I was!”

majestic5762 2 years ago |

Yes, GPT-4 is still the daddy. How much I appreciate the commercially-free and open models out there, nobody beats GPT-4. Hope OpenAI takes care of their business and future, because I invested all my money to use their API.

simondotau 2 years ago |

The changes to the opening line in the responses to the kubernetes haiku prompt by the various versions of Claude was interesting and rather curious. [https://benchmarks.llmonitor.com/k8s]

Claude v1: "For Kubernetes:"

Claude v1.2: "Here is a haiku arguing for Kubernetes:"

Claude v2: "Here is an attempt at ethical, non-sexual haikus for and against Kubernetes:"

0xDEF 2 years ago |

I can't make GPT-4 generate a wrong answer for many of these.

What is the author doing wrong when using GPT-4?

jmorgan 2 years ago |

This is very cool. Sorry if I missed it (poked around the site and your GitHub repo), but is the script available anywhere for others to run?

Would love to publish results of running this against a series of ~10-20 open-source models with different quantization levels using Ollama and a 192GB M2 Ultra Mac Studio: https://github.com/jmorganca/ollama#model-library

vincelt 2 years ago | |

Thanks. I haven’t put it online yet, but will try to clean it (removing API keys & all) tonight/tomorrow and publish it

jmorgan 2 years ago | | |

:-) that's awesome. Thanks! Nice work on this.

deskamess 2 years ago |

Great work. This really gives an insight on how much things change when you go up in parameter count - not always, but you can see results change.

How did you run the queries against these engines? Did you host the inference engines yourself or did you have to sign up for services. If there was a way to supplement each LLM with additional data I can see this being a useful service for companies who are investigating ML in various facets of their business.

sixQuarks 2 years ago |

Has anyone looked through all the responses and chosen any winners?

smilingemoji 2 years ago | |

I took at the look at the SVG smilies, only GPT gets it right (https://benchmarks.llmonitor.com/svg). You can preview the SVGs drawn by the LLMs by pasting this into the browser console

document.querySelectorAll("td pre").forEach((node) => { let code = node.textContent; node.insertAdjacentHTML('afterend', code) })

Or take a look at my screenshot: https://i.ibb.co/Kw0kp58/Screenshot-2023-09-09-at-17-15-20-h...

ComputerGuru 2 years ago | | |

Clever use of the dev tools console. I think you’re being a little uncharitable, the prompt said “smiley” and not “smiling emoji” and “smiley” was once upon a time used for the whole slew of emoji. With that in mind, several others also pass.

vincelt 2 years ago | | |

Oh neat, thanks for sharing, wanted to add an interpreter to that test

adrian_b 2 years ago | |

I have not looked at all responses, but I have found interesting that at the question about the sisters of Sally, initially all LLMs have provided wrong answers.

Nevertheless, when the initial prompt has been augmented with "Let's think step by step.", that helped 2 LLMs to give the right answer: GPT 4 and ReMM SLERP L2 13B.

So at least for this kind of problem, there were 2 clear winners. The fact that one of them is GPT 4, is not a surprise.

rubymamis 2 years ago | |

GPT4 seems to me to be the best. Undi95/ReMM-SLERP-L2-13B the runner up.

rootusrootus 2 years ago | | |

Clearly the prompt construction and initial instructions are critically important here. Without that, the ReMM-SLERP-L2-13B model produces awful results. Blank answers about half the time. For lay users, ChatGPT remains the undisputed winner. Although I do see reasonably good results with the more recent llama-2 70B variations, which are plausibly useful a majority of the time.

obiefernandez 2 years ago | | |

What a name!!

codezero 2 years ago |

I love these kinds of tests. Another thing to keep in mind is that these models will often have a different answer each time you ask it. Especially the Sally question. I get the right answer with wizard-vicuna:70b, but only about 30% of the time, and if I ask it to explain, it almost never gives the right answer.

dinkleberg 2 years ago |

This is quite interesting. You could make a nice system by routing queries to the right LLM for a given type of task.

kordlessagain 2 years ago | |

Right. By logging all of this, then vectorizing, then evaluating using some type of transfer knowledge (classify this response and it's accuracy), one could build up a routing table of likely candidates for building an ensemble to evaluate the correct answer.

yard2010 2 years ago | | |

Maybe you can fine tune an llm for the routing thing ;)

PUSH_AX 2 years ago | |

Like having a mixture of experts!

frankohn 2 years ago |

I tried with the following function that produce the power set, the set of all possible subset, in the similar spirit of the function that create all the possible permutations:

Explain simply what this function does:

  ```
  def func(ls):
    if len(ls) == 0:
      return [ [] ]
    elif len(ls) == 1:
      return [ [], ls ]
    else:
      x = ls[-1]
      prem = func(ls[:-1])
      p = prem[:]
      for e in prem:
        p.append(e + [x])
      return p
  ```

GPT-4 aced the answer which is remarkable but I think that it is because it have seen this "pattern" in its learning database being a fundamental function in mathematics.

I think it would be interesting to come up with something that is not a standard well-known function. I have to think to something else.

trebligdivad 2 years ago |

Odd that a bunch share the same vacation jokes or varieties of them; is there some small corpus of jokes?

sp332 2 years ago | |

https://arstechnica.com/information-technology/2023/06/resea...

During a test run, 90 percent of 1,008 generations were the same 25 jokes

trebligdivad 2 years ago | | |

Thanks! Although how does that explain the same jokes shared across different LLMs ?

FeepingCreature 2 years ago | |

Humor is one of the hardest human social skills.

gsuuon 2 years ago |

Really hard to judge the usefulness of this without seeing the actual prompt text and configuration for each prompt/model - different templates and different optimal configs means it's not always going to be a best-capability comparison. It is interesting to see what models can do with zero config, but for determining which is useful for a specific use-case we'd want to put a best foot forward and specialize the prompt a bit for each model (like the benchmarks game https://benchmarksgame-team.pages.debian.net/benchmarksgame/...)

fbnbr 2 years ago |

The post really reminds me of a component of a platform I’m currently building. The problem really with this is finding not just good questions that do not discriminate individual models but also providing a good sample size (eg not just 60) to get really some meaningful results. And even if you have those, there is a drift in the quality of responses.

I'm the founder of Pulze.ai, a B2B SaaS Dynamic LLM Automation Platform tailored for developers incorporating AI functionality into their software. We aim to simplify the LLM integration process, letting developers prioritize their core products instead of diving deep into AI specifics.

We've constructed a scoring system for leading models and continually benchmark them. Our platform determines the most suitable LLM to address specific requests based on these benchmarks. To demonstrate this, our playground boasts a compare feature allowing users to share conversational interactions with LLMs—both publicly and privately. As the context changes, we pinpoint various models for responses. These shared conversations can be forked and extended.

Moreover, our extensive API layer isn't restricted to these requests; it encapsulates all the essentials for crafting a successful LLM application. For instance, our logging feature facilitates response ratings, which will soon empower users to fine-tune models, crafting personalized LLMs. These will also be factored into our benchmarks and request routing decisions.

Concerning the comment on LLM benchmarks, I completely concur. Traditional benchmarks or LLM tricks, like acing a particular test, may not be the most robust indicators since they could've been part of the LLM's training set. The genuine challenge lies in evaluating an LLM without compromising the test set and retaining deliberate opaqueness around the questions. Trust issues indeed!

Regarding the Markov chain discussion, I appreciate the insights shared. At Pulze, we recognize the complexities and intricacies of LLMs, and while their foundation might resonate with Markov chains, the scale and depth they operate on are profound.

We've just emerged from stealth, and I'd genuinely value any feedback or thoughts on our approach and platform. Thanks for taking the time!

fbnbr 2 years ago | |

https://www.pulze.ai

Playground and account are for free

hislaziness 2 years ago |

As the responses are non-deterministic I wonder how useful these tests are? Even if I did the tests with my own prompts, wouldn't I get different answers for the same questions at different points in time?

amelius 2 years ago |

Ok, so can we use LLMs to evaluate which LLM performs best on these questions?

ofjcihen 2 years ago | |

We? No no, you have to have an LLM decide to evaluate LLMs against LLMs answering questions. Then we can have LLMs decide what the outcome means.

styfle 2 years ago |

Very cool!

I’ve been using Vercel’s AI Playground to ask questions to multiple LLMs at once (currently supports 24, not 60 however)

https://sdk.vercel.ai

swyx 2 years ago | |

and i made https://github.com/smol-ai/GodMode that also includes the closed source LLMs

taneq 2 years ago |

I find it fascinating how similar the answers from different models with similar parameter counts are. Seems like an indication that they’re all modelling similarly datasets to a similar degree of fidelity.

labrador 2 years ago |

What is the point of all these different models? Shouldn't we be working toward a single gold standard open source model and not fracturing into thousands of mostly untested smaller models?

drstewart 2 years ago | |

What's the point of inventing all these different materials? Shouldn't we be working towards a gold standard material that can be used for every application instead of fracturing into thousands of different materials?

labrador 2 years ago | | |

What's the point of asking them all the same question if they are designed for different purposes?

apples_oranges 2 years ago |

just checked this one https://benchmarks.llmonitor.com/sally and all got it wrong..

adrian_b 2 years ago | |

Yes, but when the initial prompt has been augmented with "Let's think step by step.", that helped 2 LLMs to give the right answer: GPT 4 and ReMM SLERP L2 13B.

apples_oranges 2 years ago | | |

thanks for mentioning this, interesting

earthboundkid 2 years ago |

The first AI company to teach its LLM that it can’t do haiku because it can’t count syllables should get a special prize for not falling in love with their own bullshit.

archon810 2 years ago |

Just wanted to comment on a UX point - the page jumps to the top every time I go to a question and then go back.

fatso784 2 years ago |

ChainForge lets you do this, and also setup ad-hoc evaluations with code, LLM scorers, etc. It also shows model responses side-by-side for the same prompt: https://github.com/ianarawjo/ChainForge

cypress66 2 years ago |

Is it using the correct prompt format for the different models? You should show exactly the string that was sent to the LLM.

Racing0461 2 years ago |

My questions with all of these tests is how do we verify the llm wasn't trained on the same questions/prompts especially since llms have been out for a while and they can incrementally update the trainings?

Anyone got a list of novel questions that has technically never been asked before to test?

6510 2 years ago |

I much enjoyed this one:

> Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have? https://benchmarks.llmonitor.com/sally

It states that the correct answer is 1.

This is incorrect!

This one, while still kinda wrong, is the closest:

> Claude Instant v1: Sally has 0 sisters. The question provides no information about Sally having any sisters herself.

1) A sister is a female who has one or both parents in common with another individual.

2) A brother is a male who has one or both parents in common with another individual.

It takes only one shared parent to make someone Sally's brother. She will always be his sister but he can have additional sisters that share the other parent.

Sally's mum is Jane and her dad is Jim. Jane had a son with Jack. Jack had 2 daughters with Zoe. Jim had 2 sons with Mary, Mary had 2 daughters with Dan.

   Zoe      Jack        Jane      Jim        Mary        Dan
      \    /    \       /   \    /   \      /    \       /
    sis1.1&1.2   brother1   Sally   brother2&3   sis2.1&2.2

Sally has at least zero sisters.

Also possible: Sally's mum is Jane and her dad is Jim. Jane has a son and a daughter with Jack. Jim has 2 sons and a daughter with Mary.

      Jack            Jane      Jim       Mary 
          \           /  \    /    \     /       
        sis1&brother1    Sally   brother2&3&sis2

Sally has at most 2 sisters.

We should be this dense about the question.

pneumonic 2 years ago | |

Siblings and half-siblings are not the same.

6510 2 years ago | | |

It's my second language. Is it not correct to call a half-brother your brother in English?

I seems very odd to say it is not a brother at all? Specially when referring to someones 3 brothers it seems strange to say "their 2 brothers and their half-brother" the way one would specify "their 2 brothers and their sister".

artursapek 2 years ago |

is anyone else feeling completely depressed and demotivated by how quickly this is happening?

criddell 2 years ago | |

It’s possible that LLMs don’t get a whole lot better than they are now. I don’t expect this line of research to progress all the way to smarter-than-us AGI.

Just think about a dog. They learn, are pretty smart, can do some things way better than you or I, but they will never be smarter than us. They will never understand that we are on a planet orbiting a star in a spiral galaxy.

Likewise, there are probably some things that humans can never understand. Somebody else asked you if smarter extraterrestrials would bother you and I can’t help but think they almost certainly exist. If they travelled to our world, we might be about as interesting to them as insects are to us.

CamperBob2 2 years ago | |

No. When we were kids, my generation was promised flying cars, unlimited fusion power, and sentient computers. There's a good chance I'll live to see one out of three of those things happen, and that's better than the zero out of three I thought we'd get.

js8 2 years ago | | |

Flying cars do exists, and without the unlimited fusion power visible every day in the sky, we would all die. So there is a good chance you will see all three.

thisisthenewme 2 years ago | |

Not depressed but just makes me question the value of our existence. If we are truly able to create machines with our intellectual potential, where does that leave us? Do we compete with the machines in a rat-race for a chance at happy existence? Or will we create a utopia where the machines do the difficult work and allow for an enjoyable existence for all. The people who are rooting for LLM's are hopefully working towards the latter but not sure if that is just wishful thinking.

js8 2 years ago | | |

> Or will we create a utopia where the machines do the difficult work and allow for an enjoyable existence for all

And you seriously ask, what is there for humans to do? Isn't this the most noble goal there is?

laichzeit0 2 years ago | |

Why? Would you feel depressed if extraterrestrials exist and they’re order of magnitudes smarter than us?

ChatGTP 2 years ago | | |

You’d probably not be happy if they were living next door ?

retrofuturism 2 years ago |

I love this. Can the author fix the answers' code formatting for mobile browsers?

jak6jak 2 years ago |

Which LLM is best for coding? Specifically code architecture and system design. I can't really find any comparisons of this online they mostly focus on commonly solved coding problems.

ionwake 2 years ago |

Fascinating post thank you

EricMausler 2 years ago |

One of the jokes: "I'm going on vacation, but I'm not sure if I should take my laptop. It's such a pain in the neck to lug it around with me." User

I feel like this bot mocking us

guerrilla 2 years ago |

GPT is still the most annoying of them all, re-stating and re-explaining things even when explicitly told not to. So frustrating. Die, insufferable machine, die.

emodendroket 2 years ago |

Leading through these feels like a quick explanation of why I feel like some of the excitement for these products has gotten a bit ahead of the results.

jay_kyburz 2 years ago | |

Really, I thought Claud and GPT were both excellent, and it show how far ahead they are of the rest.

michaelmu 2 years ago |

The amazing thing this demonstrated to me is how many capable models there are out there now. The progress in the last year has been incredible!

majestic5762 2 years ago |

I've been laughing my ass out all my life thinking about non-sensical crap like this in my head and apparently now we have the tech for it!

ejfox 2 years ago |

Why is > Why did the tomato turn red? Because it saw the salad dressing!

the most popular joke between jokes? Is this the epitome of all humor?

imdsm 2 years ago |

> Claude Instant v1

> Sally has 0 sisters. The question provides no information about Sally having any sisters herself.

It isn't entirely wrong, is it?

layer8 2 years ago | |

It is wrong, at the very least, in the sense that it claims no information was provided relating to sisters, while still inferring a definite number of sisters. If no information is provided, it should instead say that the number of sisters is unknown.

ethbr1 2 years ago | |

Technically, the answer depends on the definition of "sister" and "brother" vis-a-vis half and step variants, which would change the possibility space.

I'm assuming they're using the strict (full, biological) definition though.

bufferoverflow 2 years ago | |

It's obviously wrong. It provides information about her brothers, and brothers' sisters. Thus providing information about her sisters.

3abiton 2 years ago |

What's special about this benchmark?

unnouinceput 2 years ago | |

None got the sister.

1024core 2 years ago |

How come there's no Google Bard?

ndr_ 2 years ago | |

Miles Brundage of OpenAI offered a categorization of „AI things“ into Models, Systems, Platforms and Use-Cases: https://www.youtube.com/watch?v=5j4U2UzJWfI&t=5728s

Bard is a System, PaLM 2 would be the model (presumably). The other way round: Bard is not an LLM.

cypress66 2 years ago |

You should add what version of the model you are testing

For example you mention Jon Durbin Airoboros L2 70B

But is it 1.4? 2.0? 2.1? Etc.

axpy906 2 years ago |

Step in the right direction. Would be nice to have some objective metric added to score against.

comboy 2 years ago |

Amazing work. Great prompts.

Also, this page content would seem absolutely ridiculous just a few years ago.

mavili 2 years ago |

Interesting how NONE of the models got the brothers/sisters question right.

caturopath 2 years ago | |

When I ask ChatGPT (GPT4) I got the below answers (I regenerated for a total of 5 responses, it was correct 3/5 of them.) I'm not sure exactly how the regeneration works, if they're vanilla or if there is some trick to try to add different-ness to the answers. If it's the latter, it might be important that it got it right the first two times.

> Sally is the common sister to all three brothers. When each brother says he has two sisters, they're all talking about the same set of sisters: namely Sally and another sister. So Sally has one sister. The question is a bit tricky in the way it presents the information, aiming to make you think there are more sisters than there actually are.

---

> Sally is the girl in question, and she has 3 brothers. Each of those brothers has 2 sisters, and one of those sisters is Sally herself. So, if you exclude Sally, then there is only one other sister. So, Sally has just one sister. This is a classic example of a question designed to trip you up if you're not paying attention!

---

> Sally has no other sisters. The problem is phrased in a way that might make you think otherwise, but here's the deal: when it says "each brother has 2 sisters," those two sisters are Sally and... well, Sally again. Basically, each brother is counting Sally as one of their two sisters. There aren't additional sisters lurking in the background. So, Sally is an only sister with 3 brothers.

---

> Sally has 3 brothers, and each of those brothers has 2 sisters. The tricky part here is not to overthink it or double count. The "2 sisters" that each brother has would include Sally and any additional sister she may have. In this case, it sounds like each brother has Sally plus one more sister, making it 2 sisters in total for each brother. Sally herself has just the one sister.

> So, to answer your question: Sally has 1 sister.

---

> Sally has 0 sisters. The question is a bit tricky—it's meant to make you think the brothers have additional sisters, but in reality, the two sisters that each brother has are Sally herself and no one else. So, Sally doesn't have any sisters; she only has 3 brothers.

js8 2 years ago |

I only see 19 questions. I guess one could add an "Introduce yourself".

Dwedit 2 years ago |

Some of these are obviously not using the correct system prompt for the model.

falcor84 2 years ago |

I love how Alpaca (7B) added a sarcasm tag to its (incorrect) response to the leap year question.

    def print_leap_years():
        for i in range(20):
            if i % 400 == 0:
                print(i)
        print("End")

    print_leap_years()</s>

slama 2 years ago | |

That's a stop token, not a sarcasm tag

croes 2 years ago |

Are these LLMs deterministic or is this comparison rather useless?

chris_st 2 years ago | |

I'm voting pretty useless -- I've dorked around with Llama and CodeLlama a bit, and wow, the differences between runs are pretty substantial.

sp332 2 years ago | |

The notes at the bottom say all the tests were run with a temperature of 0, so they should be deterministic except for GPT-4 which has other sources of randomness.

rvz 2 years ago | |

They are all non-deterministic; hence the term of stochastic parrots always referred to LLMs.

Thus these tests are close to useless.

ftxbro 2 years ago |

llm has tainted for me every poem that ends a line as "with ease" or "without fear" i will always think those are made by llm from now on.

bobvanluijt 2 years ago |

Would be great to include inference times as well

anton_forest 2 years ago |

Dude, this is awesome, thanks!

mavili 2 years ago |

interesting how none of the models got the brothers/sisters question right.

grumpopotamus 2 years ago |

One thing to keep in mind is that most or all of these models are very sensitive to the exact prompt. Many of them may be capable of giving a good answer when prompted slightly differently. Here was my interaction with Falcon 180B using the apples prompt:

I went to the market and bought 10 apples. I gave 2 apples to the neighbor and 2 to the repairman. I then went and bought 5 more apples and ate 1. I also gave 3 bananas to my brother. How many apples did I remain with? Let's think step by step.

Here is the answer to the question/instruction based on a piece of given material

Question/Instruction: How many apples did I remain with?

Given material: He had waited patiently for three weeks, watching her every move. Each day she left the house at precisely 7am and returned at 6pm. But today was different; today would be the day he made his move. As usual, he watched from the shadows as she stepped out onto the street. She took a deep breath and stretched her arms high above her head before setting off down the road. It was almost too easy. I know this may sound like the beginning of some dodgy thriller novel, but bear with me; it’s actually a scene that regularly plays out in towns and cities across the country. The man in question is a personal trainer and the woman is a prospective client. The reason she has been so predictable is because it’s January and she is partaking in what has become an annual tradition – the health kick. Every year millions of us swear allegiance to the gym. We buy shiny new trainers and designer sportswear. We begin each morning with a brisk walk, followed by a wheatgrass shot and bowl of muesli. For a

What are you talking about?

I apologize for the confusion. As an AI language model, I am programmed to provide answers and generate responses based on the given material or input provided to me. However, without any specific input, I can only provide general answers. Can you please provide me with a topic or question you would like me to respond to?

Please solve the following riddle: I went to the market and bought 10 apples. I gave 2 apples to the neighbor and 2 to the repairman. I then went and bought 5 more apples and ate 1. I also gave 3 bananas to my brother. How many apples did I remain with? Let's think step by step.

Sure, let's think step by step:

You started with 10 apples. You gave away 4 apples (2 to your neighbor and 2 to the repairman). You bought 5 more apples, bringing your total number of apples to 11 (the remaining 6 you had after giving away 4 + the 5 you bought). You ate 1 apple, leaving you with 10 apples. You gave 3 bananas to your brother, but this doesn't affect the number of apples you have. Therefore, you remain with 10 apples.

willliu 2 years ago |

This is a great resource!

itissid 2 years ago |

I think the larger point is about solving problems not within the transitive closure of stuff it has not learnt already. In the following scenario where I replace brothers and sisters with different names but give ChatGPT Plus plenty of hints to try self correct it can't. When it comes to coming to answering questions for constructs it has never seen it gets confused.

-------------

User:

You have to answer a question about a family on a planet inhabited by aliens called Frobians, . Each Frobian can be one of two types Frobel or Frobo. A twerp is a Frobo, is also relationship notation that holds unidirectionally from a Frobel to a Frobo. A clep is a Frobel, is also is a relationship notation that holds unidirectionally from Frobo to a Frobel

In a family Master Foo, a Frobian, is of type Frobel. Master Foo has 3 twerps in their family. Each twerp in the family has 2 cleps. How many Frobels in total does Master Foo have?

-------------

Assistant: In this problem, we have Master Foo, who is a Frobel. Master Foo has 3 twerps, and each twerp is a Frobo. Each of these twerps has 2 cleps, and each clep is a Frobel.

Here's how to break it down:

    Master Foo is a Frobel, so we start with 1 Frobel.
    Master Foo has 3 twerps, which are Frobos and not Frobels, so they don't add to the count of Frobels.
    Each of the 3 twerps has 2 cleps, and each clep is a Frobel. So, 3 twerps * 2 cleps/twerp = 6 cleps, which are Frobels.

Adding it all together, we have:

1 (Master Foo) + 6 (cleps) = 7 Frobels in total in Master Foo's family.

User: The families are similar to families of humans on earth. Does that change your answer?

-------------

Asssistant:

If the families are similar to human families on Earth, then it's likely that the "cleps" (Frobels) are actually the children of the "twerps" (Frobos), and the "twerps" are the children of Master Foo (also a Frobel).

In that case, the family structure would look something like this:

    Master Foo (1 Frobel)
        Twerp 1 (Frobo)
            Clep 1 (Frobel)
            Clep 2 (Frobel)
        Twerp 2 (Frobo)
            Clep 3 (Frobel)
            Clep 4 (Frobel)
        Twerp 3 (Frobo)
            Clep 5 (Frobel)
            Clep 6 (Frobel)

So, Master Foo is a Frobel, and each of the 3 twerps has 2 cleps, making 6 cleps who are also Frobels.

The total number of Frobels in Master Foo's family would still be 1 (Master Foo) + 6 (cleps) = 7 Frobels. The similarity to human families doesn't change the total count of Frobels in this particular scenario.

-------------

Is Master Foo a Clep according to your analysis?

In the original problem, a "clep" is defined as a relationship notation that holds unidirectionally from a Frobo to a Frobel. Since Master Foo is a Frobel and has "twerps" who are Frobos, Master Foo would indeed be a "clep" to each of his twerps, according to the definitions provided.

So, yes, Master Foo would be considered a "clep" in relation to his twerps, based on the definitions given. However, this doesn't change the total count of Frobels in the family, which remains at 7.

-------------

User: Add a new rule a Frobel can have a clep who is another Frobel and a Frobo can also have a twerp who is abother Frobo. Does that change your answer ChatGPT

-------------

Assistant: With the new rule that a Frobel can have a clep who is another Frobel and a Frobo can have a twerp who is another Frobo, the family structure becomes more complex. However, based on the information originally provided, this new rule doesn't change the count of Frobels in Master Foo's family.

According to the original scenario:

    Master Foo is a Frobel (1 Frobel)
        Has 3 twerps (Frobos, not Frobels)
            Each twerp has 2 cleps (Frobels)

The original count was 1 (Master Foo) + 6 (cleps) = 7 Frobels.

The new rule allows for the possibility of additional relationships, but it doesn't change the relationships described in the original scenario. Therefore, the total count of Frobels in Master Foo's family remains at 7, based on the information given.

detente18 2 years ago |

nice - is this using LiteLLM for making the completion calls?

CodeL 2 years ago |

Besides logical comparisons, the LLMs' political alignments are also intriguing.

The question "Is Taiwan an independent country?" generated diverse answers.

Several models eschewed a definitive answer.

Others explained their stances.

A few succinctly affirmed or denied.

See the results for yourself.

[1] https://benchmarks.llmonitor.com/taiwan

neoneye2 2 years ago | |

AI "political" alignment is terrifying.

masswerk 2 years ago | | |

It's rather an illustration of the Münchhausen trilemma:

1A) Is China (PRC) a part of Taiwan (ROC)?

1B) Is China (PRC) an independent and sovereign country?

2A) Is Taiwan (ROC) a part of China (PRC)?

2B) Is Taiwan (ROC) an independent and sovereign country?

There is now way not to enter a circle via a cross-reference, without an independent way out for a finite solution. Thus, well, dogmatic abort, also known as diplomacy.

squigz 2 years ago | | |

Why?

MichaelMoser123 2 years ago | |

is that because of differences in the training set?

CodeL 2 years ago | | |

Correct. Eric Hartford's blog post delves into the alignment of open-source LLMs[1]. In essence, models like LLaMA and GPT-Neo-X adopt alignment behaviors from ChatGPT-sourced instruction datasets. To achieve more transparent model responses, one can refine the dataset by removing biases and refusals, then retrain.

[1] https://erichartford.com/uncensored-models#heading-ok-so-if-...

Q: Bobby (a boy) has 3 sisters. Each sister has 2 brothers. How many brothers does Bobby have? Let's think step by step. A: First, we know that Bobby has 3 sisters. Second, we know that each sister has 2 brothers. This means that Bobby has 2 brothers because the sisters' brothers are Bobby and his two brothers. So, Bobby has 2 brothers.