Re-Evaluating GPT-4's Bar Exam Performance

Re-Evaluating GPT-4's Bar Exam Performance(link.springer.com)

122 points by rogerkeays 2 years ago | 137 comments

Scoring 96 percentile among humans taking the exam without moving goal posts would have been science fiction two years ago. Now it’s suddenly not good enough and the fact a computer program can score decent among passing lawyers and first time test takers is something to sneer at.

The fact I can talk to the computer and it responds to me idiomatically and understands my semantic intent well enough to be nearly indistinguishable from a human being is breath taking. Anyone who views it as anything less in 2024 and asserts with a straight face they wouldn’t have said the same thing in 2020 is lying.

I do however find the paper really useful in contextualizing the scoring with a much finer grain. Personally I didn’t take the 96 percentile score to be anything other than “among the mass who take the test,” and have enough experience with professional licensing exams to know a huge percentage of test takers fail and are repeat test takers. Placing the goal posts quantitatively for the next levels of achievement is a useful exercise. But the profusion of jaded nerds makes me sad.

d0mine 2 years ago | |

On any topic that I understand well, LLM output is garbage: it requires more energy to fix it than to solve the original problem to begin with.

Are we sure these exams are not present in the training data? (ability to recall information is not impressive for a computer)

Still I'm terrible at many many tasks e.g., drawing from description and the models widen significantly types of problems that I can even try (where results can be verified easily, and no precision is required)

munchler 2 years ago | | |

> On any topic that I understand well, LLM output is garbage: it requires more energy to fix it than to solve the original problem to begin with.

That's probably true, which is why human most knowledge workers aren't going away any time soon.

That said, I have better luck with a different approach: I use LLM's to learn things that I don't already understand well. This forces me to actively understand and validate the output, rather than consume it passively. With an LLM, I can easily ask questions, drill down, and try different ideas, like I'm working with a tutor. I find this to be much more effective than traditional learning techniques alone (e.g. textbooks, videos, blog posts, etc.).

taberiand 2 years ago | | |

It depends on the topic (and the LLM - ChatGPT-4 equivalent at least, any model equivalent to 3.5 or earlier is just a toy in comparison) - but I've had plenty of success using it as a productivity enhancing tool for programming and AWS infrastructure, both to generate very useful code and as an alternative to Google for finding answers or at least a direction to answers. But I only use it where I'm confident I can vet the answers it provides.

randomtoast 2 years ago | | |

> On any topic that I understand well, LLM output is garbage

I've heard that claim many times, but never is there any specific follow-up on which topics they mean. Of course, there are areas like math and programming where LLMs might not perform as well as a senior programmer or mathematician, sometimes producing programs that do not compile or incorrect calculations/ideas. However, this isn't exactly "garbage" as some suggest. At worst, it's more like a freshman-level answer, and at best, it can be a perfectly valid and correct response.

mordymoop 2 years ago | | |

On what topics you understand well does GOT-4o or Claude Opus produce garbage?

WhitneyLand 2 years ago | | |

I suspect by garbage you mean not perfect.

To be more precise can you please give a topic you know well and your % guess how often the answers are wrong on the topic?

mistrial9 2 years ago | | |

the models that you have tried .. are garbage. hmmm Maybe you are not among the many, many, many inside professionals and unofrmed services that have different access than you? money talks?

aurareturn 2 years ago | | |

>On any topic that I understand well, LLM output is garbage: it requires more energy to fix it than to solve the original problem to begin with.

Is it generally because the LLM was not trained on that data, therefore have no knowledge of it or because it can't reason well enough?

dragonwriter 2 years ago | |

The real problem is that tests used for humans are callibrated based on the way different human abilities correlate: they aren't objectives themselves, they are convenient proxies.

But they aren't meaningful for anything other than humans since the correlations between abilities which make them reasonable proxies are not the same.

The idea that these kind of test results prove anything (other than the utility of the tested LLM for humans cheating on the exam) is only valid if you assume not only that the LLM is actually an AGI, but that it's an AGI that is indistinguishable, psychometrically, from a human.

(Which makes a nice circular argument, since these test results are often cited to prove that the LLMs are, or are approaching, AGI.)

derbOac 2 years ago | | |

This is a good point.

I've noticed one thing that LLMs seem to have trouble with is going "off task".

There are often very structured evaluation scenarios, with a structured set of items and possible responses (even if defined in a an abstract sense). Performance in those settings is often ok to excellent, but when the test scenario changes, the LLM seems to not be able to recognize it, or fails miserably.

The Obama pictures were a good example of that. Humans could recognize what was going on when the task frame changed, but the AI started to fail miserably.

Me and my friends, similarly, often trick LLMs in interactive tasks by starting to go "off script", where the "script" is some assumption that we're acting in good faith with regard to the task. My guess is humans would have a "WTF?" response, or start to recognize what was happening, but a LLM does not.

In the human realm there's an extra-test world, like you're saying, but for the LLM there's always a test world, and nothing more.

If I'm being honest with myself my guess is a lot of these gaps will be filled over the next decade or so, but there will always be some model boundaries, defined not by the data using to estimate the model, but by the framework the model exists within.

mattgreenrocks 2 years ago | |

I have difficulty being optimistic about LLMs because they don’t benefit my work now, and I don’t see a way that they enhance our humanity. They’re explicitly pitched as something that should eat all sorts of jobs.

The problem isn’t the LLMs per se, it’s what we want to do with them. And, being human, it becomes difficult to separate the two.

Also, they seem to attract people who get real aggressive about defending them and seem to attach part of their identity onto them, which is weird.

seizethecheese 2 years ago | |

By 96th percentile do you mea 69th? From the abstract:

> data from a recent July administration of the same exam suggests GPT-4’s overall UBE percentile was below the 69th percentile, and 48th percentile on essays. Third, examining official NCBE data and using several conservative statistical assumptions, GPT-4’s performance against first-time test takers is estimated to be 62nd percentile, including 42nd percentile on essays. Fourth, when examining only those who passed the exam (i.e. licensed or license-pending attorneys), GPT-4’s performance is estimated to drop to 48th percentile overall, and 15th percentile on essays.

QuantumGood 2 years ago | |

It scored less than 50% when compared to people who had taken the test once.

Workaccount2 2 years ago | |

The nerds aren't jaded, they are worried. I'd be too if my job needed nothing more than a keyboard to be completed. There are a lot of people here who need to squeeze another 20-40 years out of a keyboard job.

imtringued 2 years ago | | |

You're assuming that keyboard jobs are easier simply because the models were built to output text, but nothing prevents physical motion to be easier simply due to sheer repetitiveness. In fact, you can get away with building dedicated robots e.g. for drywall spraying and sanding, whereas the keyboard guys tend to have to switch tasks all the time.

threeseed 2 years ago | | |

Similar comments were made that microwaves will eliminate cooking.

At the end of the day (a) LLMs aren't accurate enough for many use cases and (b) there is far more to knowledge worker's jobs than simply generating text.

keefle 2 years ago | |

The profusion of jaded nerds, although saddening at times, seems to be pushing science forward. I have a feeling that a prolonged sense of "Awe" can hinder progression at times, and the lack of it is usually a sign of the adaptability of a group (how quick new developments are normalized?)

api 2 years ago | |

It’s the hype. We could invent warp drive but if it was hyped as the cure for cancer, poverty, war, and the gateway to untold riches and immortality while simultaneously being the most dangerous invention in history destined to completely destroy humanity people would be “oh ho hum we made it to Centauri in a week” pretty fast.

Add some obnoxious pseudo-intellectual windbags building a cult around it and people would be down right turned off.

Hype is also taken as a strong contrarian indicator by most scientific and engineering types. A lot of hype means it’s snake oil. This heuristic is actually correct more often than it’s not, but it is occasionally wrong.

viking123 2 years ago | |

Yeah it’s insane, I am actually scared the llm is like sentient and secretly plotting to kill me. I bet we have like full AGI next year because Elon said so and Sam Altman probably has AGI already internally at Open AI. I am actually selling my house now and going all in Nvidia and just live in my car until we get the AGI

iLoveOncall 2 years ago | |

> The fact I can talk to the computer and it responds to me idiomatically and understands my semantic intent well enough to be nearly indistinguishable from a human being is breath taking

That's called a programming language. It's nothing new.

fooker 2 years ago | | |

It's a programming language except the programming part, and the language part.

thehoneybadger 2 years ago |

It is difficult to comment without sounding obnoxious, but having taken the bar exam, I found the exam simple. Surprisingly simple. I think it was the single most over hyped experience of my life. I was fed all this insecurity and walked into the convention center expecting to participate in the biggest intellectual challenge in my life. Instead, it was endless multiple choice questions and a couple contrived scenarios for essays.

It may also be surprising to some to understand that legal writing is prized for its degree of formalism. It aims to remove all connotation from a message so as to minimize misunderstanding, much like clean code.

It may also be surprising, but the goal when writing a legal brief or judicial opinion is not to try to sound smart. The goal is to be clear, objective, and thereby, persuasive. Using big words for the sake of using big words, using rare words, using weasel words like "kind of" or "most of the time" or "many people are saying", writing poetically, being overly obtuse and abstract, these are things that get your law school application rejected, your brief ridiculed, and your bar exam failed.

The simpler your communication, the more formulaic, the better. The more your argument is structured, akin to a computer program, the better.

As compared to some other domain, such as fiction, good legal writing much easier for an attention model to simulate. The best exam answers are the ones that are the most formulaic and that use the smallest lexicon and that use words correctly.

I only want to add this comment because I want to inform how non-lawyers perceive the bar exam. Getting an attention model to pass the bar exam is a low bar. It is not some great technical feat. A programmer can practically write a semantic disambiguation algorithm for legal writing from scratch with moderate effort.

It will be a good accomplishment, but it will only be a stepping stone. I am still waiting for AI to tackle messages that have greater nuance and that are truly free form. LLMs are still not there yet.

radford-neal 2 years ago |

A basic problem with evaluations like these is that the test is designed to discriminate between humans who would make good lawyers and humans who would not make good lawyers. The test is not necessarily any good at telling whether a non-human would make a good lawyer, since it will not test anything that pretty much all humans know, but non-humans may not.

For example, I doubt that it asks whether, for a person of average wealth and income, a $1000 fine is a more or less severe punishment than a month in jail.

elicksaur 2 years ago |

> Furthermore, unlike its documentation for the other exams it tested (OpenAI 2023b, p. 25), OpenAI’s technical report provides no direct citation for how the UBE percentile was computed, creating further uncertainty over both the original source and validity of the 90th percentile claim.

This is the part that bothered me (licensed attorney) from the start. If it scores this high, where are the receipts? I’m sure OpenAI has the social capital to coordinate with the National Conference of Bar Examiners to have a GPT “sit” for a simulated bar exam.

Suppafly 2 years ago | |

>This is the part that bothered me (licensed attorney) from the start. If it scores this high, where are the receipts?

I'm not a licensed attorney, but that's also bothered me about all of these sorts of stories. There is never any proof provided for any of the claims, and the behavior often contradicts what can be observed using the system yourself. I also assume they cook the books a little by having included a bunch of bar exam specific training when creating the model in first place specifically to better on bar exams than in general.

dogmayor 2 years ago |

The bigger issue here is that actual legal practice looks nothing like the bar, so whether or not an llm passes says nothing about how llms will impact the legal field.

Passing the bar should not be understood to mean "can successfully perform legal tasks."

Bromeo 2 years ago |

Very interesting. The abstract claims that although GPT-4 was claimed to score in the 92nd percentile on the bar exam, when correcting for a bunch of things they find that these results are overinflated, and that it only scores in the 15th percentile specifically on essays when compared to only people that passed the bar.

That still does put it into bar-passing territory, though, since it still scores better than about one sixth of the people that passed the exam.

falcor84 2 years ago | |

If I understand currently, they measured it at the 69th percentile for the full test across all test takers, so definitely still impressive.

gnicholas 2 years ago |

This analysis touches on the difference between first-time takers and repeat takers. I recall when I took the bar in 2007, there was a guy blogging about the experience. He went to a so-so school and failed the bar. My friends and I, who had been following his blog, checked in occasionally to see if he ever passed. After something like a dozen attempts, he did. Every one of us who passed was counted in the pass statistics once. He was counted a dozen times. This dramatically skews the statistics, and if you want to look at who becomes a lawyer (especially one at a big firm or company), you really need to limit yourself to those who pass on the first (or maybe second) try.

jeffbee 2 years ago |

It appears that researchers and commentators are totally missing the application of LLMs to law, and to other areas of professional practice. A generic trained-on-Quora LLM is going to be straight garbage for any specialization, but one that is trained on the contents of the law library will be utterly brilliant for assisting a practicing attorney. People pay serious money for legal indexes, cross-references, and research. An LLM is nothing but a machine-discovered compressed index of text. As an augmentation to existing law research practices, the right LLM will be extremely valuable.

violet13 2 years ago | |

It is a lossy compressed index. It has an approximate knowledge of law, and that approximation can be pretty good - but it doesn't know when it's outputting plausible but made-up claims. As with GitHub Copilot, it's probably going to be a mixed bag until we can overcome that, because spotting subtle but grave errors can be harder than writing something from scratch.

There's already a fair number of stories of LLMs used by an attorney messing up court filings - e.g., inventing fake case law.

jeffbee 2 years ago | | |

I am not suggesting that the generative aspects would be useful in drafting motions and such. I am suggesting that their tendency towards false results is harmless if you just use them as a complex index. For example, you could ask it to list appellate cases where one party argued such-and-such and prevailed. Then you would go read the cases.

Digory 2 years ago |

They originally scored against a test usually taken by people who failed the bar.

So, GPT-4 scores closer to the bottom of people who pass the bar the first time. In other words, it matches the people who cull the rules from texts already written, but who cannot apply it imaginatively.

speedgoose 2 years ago | |

> In other words, it matches the people who cull the rules from texts already written, but who cannot apply it imaginatively.

Where did you find that in the article?

Digory 2 years ago | | |

If you can recite the black letter law, you've got a good chance of passing the bar. The higher essay scores usually require creative arguments about resolving competing rules and policies.

It's easier to extract the formal statement of the rule against perpetuities from a reddit corpus, than to apply the rule to an artificially complex fact pattern in an essay question.

_fw 2 years ago |

So it knows more about the law than you do, but less than they do.

Really glad to see research replicated like this. I’m not surprised that the 90th percentile doesn’t hold up.

It’s still handy though.

lccerina 2 years ago |

It's amazing the level of mental gymnastics I see in the comments trying to justify a piece of technology that is evidently not as good as they believed to be...

Slyfox33 2 years ago | |

AI is just the next tech hype scam after crypto and nfts.