Re-Evaluating GPT-4's Bar Exam Performance(link.springer.com) |
Re-Evaluating GPT-4's Bar Exam Performance(link.springer.com) |
The fact I can talk to the computer and it responds to me idiomatically and understands my semantic intent well enough to be nearly indistinguishable from a human being is breath taking. Anyone who views it as anything less in 2024 and asserts with a straight face they wouldn’t have said the same thing in 2020 is lying.
I do however find the paper really useful in contextualizing the scoring with a much finer grain. Personally I didn’t take the 96 percentile score to be anything other than “among the mass who take the test,” and have enough experience with professional licensing exams to know a huge percentage of test takers fail and are repeat test takers. Placing the goal posts quantitatively for the next levels of achievement is a useful exercise. But the profusion of jaded nerds makes me sad.
Are we sure these exams are not present in the training data? (ability to recall information is not impressive for a computer)
Still I'm terrible at many many tasks e.g., drawing from description and the models widen significantly types of problems that I can even try (where results can be verified easily, and no precision is required)
That's probably true, which is why human most knowledge workers aren't going away any time soon.
That said, I have better luck with a different approach: I use LLM's to learn things that I don't already understand well. This forces me to actively understand and validate the output, rather than consume it passively. With an LLM, I can easily ask questions, drill down, and try different ideas, like I'm working with a tutor. I find this to be much more effective than traditional learning techniques alone (e.g. textbooks, videos, blog posts, etc.).
I've heard that claim many times, but never is there any specific follow-up on which topics they mean. Of course, there are areas like math and programming where LLMs might not perform as well as a senior programmer or mathematician, sometimes producing programs that do not compile or incorrect calculations/ideas. However, this isn't exactly "garbage" as some suggest. At worst, it's more like a freshman-level answer, and at best, it can be a perfectly valid and correct response.
To be more precise can you please give a topic you know well and your % guess how often the answers are wrong on the topic?
Is it generally because the LLM was not trained on that data, therefore have no knowledge of it or because it can't reason well enough?
But they aren't meaningful for anything other than humans since the correlations between abilities which make them reasonable proxies are not the same.
The idea that these kind of test results prove anything (other than the utility of the tested LLM for humans cheating on the exam) is only valid if you assume not only that the LLM is actually an AGI, but that it's an AGI that is indistinguishable, psychometrically, from a human.
(Which makes a nice circular argument, since these test results are often cited to prove that the LLMs are, or are approaching, AGI.)
I've noticed one thing that LLMs seem to have trouble with is going "off task".
There are often very structured evaluation scenarios, with a structured set of items and possible responses (even if defined in a an abstract sense). Performance in those settings is often ok to excellent, but when the test scenario changes, the LLM seems to not be able to recognize it, or fails miserably.
The Obama pictures were a good example of that. Humans could recognize what was going on when the task frame changed, but the AI started to fail miserably.
Me and my friends, similarly, often trick LLMs in interactive tasks by starting to go "off script", where the "script" is some assumption that we're acting in good faith with regard to the task. My guess is humans would have a "WTF?" response, or start to recognize what was happening, but a LLM does not.
In the human realm there's an extra-test world, like you're saying, but for the LLM there's always a test world, and nothing more.
If I'm being honest with myself my guess is a lot of these gaps will be filled over the next decade or so, but there will always be some model boundaries, defined not by the data using to estimate the model, but by the framework the model exists within.
The problem isn’t the LLMs per se, it’s what we want to do with them. And, being human, it becomes difficult to separate the two.
Also, they seem to attract people who get real aggressive about defending them and seem to attach part of their identity onto them, which is weird.
> data from a recent July administration of the same exam suggests GPT-4’s overall UBE percentile was below the 69th percentile, and 48th percentile on essays. Third, examining official NCBE data and using several conservative statistical assumptions, GPT-4’s performance against first-time test takers is estimated to be 62nd percentile, including 42nd percentile on essays. Fourth, when examining only those who passed the exam (i.e. licensed or license-pending attorneys), GPT-4’s performance is estimated to drop to 48th percentile overall, and 15th percentile on essays.
At the end of the day (a) LLMs aren't accurate enough for many use cases and (b) there is far more to knowledge worker's jobs than simply generating text.
Add some obnoxious pseudo-intellectual windbags building a cult around it and people would be down right turned off.
Hype is also taken as a strong contrarian indicator by most scientific and engineering types. A lot of hype means it’s snake oil. This heuristic is actually correct more often than it’s not, but it is occasionally wrong.
That's called a programming language. It's nothing new.
It may also be surprising to some to understand that legal writing is prized for its degree of formalism. It aims to remove all connotation from a message so as to minimize misunderstanding, much like clean code.
It may also be surprising, but the goal when writing a legal brief or judicial opinion is not to try to sound smart. The goal is to be clear, objective, and thereby, persuasive. Using big words for the sake of using big words, using rare words, using weasel words like "kind of" or "most of the time" or "many people are saying", writing poetically, being overly obtuse and abstract, these are things that get your law school application rejected, your brief ridiculed, and your bar exam failed.
The simpler your communication, the more formulaic, the better. The more your argument is structured, akin to a computer program, the better.
As compared to some other domain, such as fiction, good legal writing much easier for an attention model to simulate. The best exam answers are the ones that are the most formulaic and that use the smallest lexicon and that use words correctly.
I only want to add this comment because I want to inform how non-lawyers perceive the bar exam. Getting an attention model to pass the bar exam is a low bar. It is not some great technical feat. A programmer can practically write a semantic disambiguation algorithm for legal writing from scratch with moderate effort.
It will be a good accomplishment, but it will only be a stepping stone. I am still waiting for AI to tackle messages that have greater nuance and that are truly free form. LLMs are still not there yet.
For example, I doubt that it asks whether, for a person of average wealth and income, a $1000 fine is a more or less severe punishment than a month in jail.
This is the part that bothered me (licensed attorney) from the start. If it scores this high, where are the receipts? I’m sure OpenAI has the social capital to coordinate with the National Conference of Bar Examiners to have a GPT “sit” for a simulated bar exam.
I'm not a licensed attorney, but that's also bothered me about all of these sorts of stories. There is never any proof provided for any of the claims, and the behavior often contradicts what can be observed using the system yourself. I also assume they cook the books a little by having included a bunch of bar exam specific training when creating the model in first place specifically to better on bar exams than in general.
Passing the bar should not be understood to mean "can successfully perform legal tasks."
That still does put it into bar-passing territory, though, since it still scores better than about one sixth of the people that passed the exam.
There's already a fair number of stories of LLMs used by an attorney messing up court filings - e.g., inventing fake case law.
So, GPT-4 scores closer to the bottom of people who pass the bar the first time. In other words, it matches the people who cull the rules from texts already written, but who cannot apply it imaginatively.
Where did you find that in the article?
It's easier to extract the formal statement of the rule against perpetuities from a reddit corpus, than to apply the rule to an artificially complex fact pattern in an essay question.
Really glad to see research replicated like this. I’m not surprised that the 90th percentile doesn’t hold up.
It’s still handy though.
So maybe it's easy if you study that stuff for a year or two. But you can't just walk in and expect to pass, or bullshit your way through it.
I agree with you on legal writing, but there appears to be a certain amount of ambiguity inherent to language. The Uniform Commercial Code, for instance, is maddeningly vague at points.
Also, sometimes sample exams are made extra difficult, to convince students that they need to shell out thousands of dollars for prep courses. I recall getting 75% of questions wrong on some sections of a bar prep company's pre-test, which I later realized was designed to emphasize unintuitive/little-known exceptions to general rules. These corners of the law made up a disproportionate number of the questions on the pre-test and gave the impression that the student really needed to work on that subject.
A key item which jumped out at me right away is that in addition to the logic, the possible answers would include things which the scenario didn't address. Like, a wrong answer might make an assumption that you couldn't arrive to via the scenario. More tricky were the answers which made assumptions which you knew to be correct (based on a real event,) but still wasn't addressed in the scenario. If you combined these two elements (getting the logic right, and eliminating assumptions which you couldn't make from the scenario) then you could do well on those.
The sections I wouldn't have passed were those which required specific law knowledge. So, some sections were general, while others required knowledge of something like real estate law. I don't remember if these questions were otherwise similar to the ones I could pass.
An LLM is taking this test as essentially an open book.
Keep in mind even today[1] ( in California and few other states) you don't need to go law school to write the Bar exam and practice law, various forms of apprenticeship under a judge or lawyer are allowed
You also don't need to write the exam to practice many aspects of the legal profession.
The exam is never meant to be a high bar of quality or selection,it was always just a simple validation if you know your basics. Law like many other professions always operated on reputation and networks, not on degrees and certifications.
[1] Unlike say being a doctor, you have to go to med school without exception
> The more your argument is structured, akin to a computer program, the better.
You certainly make legal writing sound like a flavor of technical writing. Simplicity, clarity, structure. Is this an accurate comparison ?
"For a person of average wealth and income, a $1000 fine is generally less severe than a month in jail. A month in jail entails loss of freedom, potential loss of employment, and social stigma, while a $1000 fine, though financially burdensome, does not affect one's freedom or ability to work" --ChatGPT 4o
Where is that coming from ? That's a very lawyery way to phrase things.
"potential ?" where I live I think people may max out their holidays and overtime (if lucky enough) and leave-without-pay but there would be a conversation with your employer to justify it and how to handle the workload.
In the USA, from what I read, it's more than likely that you would just be fired on the spot, right ?
edit: just googled a bit, where I live you must tell your employer why you will be absent if you go to jail but that can't be used to justify the breaking of the contract unless the reason for the incarceration is damaging to the company and... yeah, I am definitely not a lawyer :]
Would be cool to know how LLMs shape their opinions.
- Memorization requires you to retain the details of a large amount of material
- The most time-efficient analysis uses instant-recall of relevant general themes to guide research
- Ergo, if someone can memorize and recall a large number of details, they can probably also recall relevant general themes, and therefore quickly perform quality analysis
(Side note: memorization also proves you actually read the material in the first place)
Nobody does except a bunch of HNers who among other things, apparently have no idea that a considerable chunk of rulings and opinions in the US federal court system and upper state courts are drafted by law clerks who, ahem, have not taken the bar yet...
The point of the bar and MPRE is like the point of most professional examinations: try to establish minimum standards. That said, the bar does test for "successfully perform legal tasks", actually.
For the US bar, a chunk of your score is based off following instructions on case from the lead attorney, and another chunk is based on essay answers. Literally demonstrating that you can perform legal tasks and have both the knowledge and critical thinking skills necessary.
Further, as previously mentioned, in the US, people usually take it after a clerkship...where they've been receiving extensive training and experience in practical application of law.
Further, law firms do not hire purely based on your bar score. They also look at your grades, what programs you participated in (many law schools run legal clinics to help give students some practical experience, under supervision), your recommendations, who you clerked for, etc. When you're hired, you're under supervision by more senior attorneys as you gain experience.
There's also the MPRE, or ethics test - which involves answering how to handle theoretical scenarios you would find yourself in as a practicing attorney.
Multiple people in this discussion are acting like it's a multiple choice test and if you pass, you're given a pat on the ass and the next day you roll into criminal court and become lead on a murder case...
The difference between expectation and reality is tripping people up in both directions — a nearly-free everything-intern is still very useful, but to treat LLMs* as experts (or capable of meaningful on-the-job learning if you're not fine-tuning the model) is a mistake.
* special purpose AI like Stockfish, however, should be treated as experts
They basically say two things. First, although the measurement is repeatable at face value, there are several factors that make it less impressive than assumed, and the model performs fairly poorly compared to likely prospective lawyers. Second, there is a number of reasons why the percentile on the test doesn't measure lawyering skills.
One of the other interesting points they bring up is that there is no incentive for humans to seek scores much above passing on the test, because your career outlook doesn't depend on it in any way. This is different from many other placement exams.
Edit: LLMs biggest feat is being a natural language interpreter, so it can run natural language scripts. It is far from perfect at it, but that is still programming.
The essay questions also test memorization. They don’t require any difficult analysis - just superficial issue-spotting and reciting the correct elements.
If the bar exam were not a memorization test, it would be open book!
Well, in a lot of the so-called soft sciences, you can easily beat a test without subject knowledge. I had figured that the bar exam might be something like that -- but it's more akin to something like biology, where there are a lot of arcane and counterintuitive little rules that have emerged over time. And you need to know those, or you're sunk. You can't guess your way past them, because the best-looking guesses tend to be the wrong ones.
(For what it's worth, I realize that this mostly has to do with the Common Law's reverence of precedent-as-binding, and that continental Civil Law systems don't suffer as much from it. But I suppose those continental systems have other problems of their own.)
Laws are not logical constructs, they are political constructs, why expect logic from them.
Laws are passed or repealed because it is popular or politically advantageous to do so, not necessarily because it is moral or common sense.
On top of that politicians pass stupid legislation without understanding what they are doing all the time, the infamous Indiana pi bill is a simple example, it almost became law and was stopped by sheer luck, a mathematics professor attending that day on a unrelated matter.
Laws are conflicting, confusing, ambiguous and misleading most of the time, that is expected, the legislating them is a messy process. The third arm of any government judiciary sole purpose is to handle this mess.
---
P.S. I cannot say whether continental legal systems are more robust, but perhaps healthier democracy results in better laws.
All democracies are flawed, U.S. democracy is not particularly healthy, it is not say an proportional multi-party representative system with fair distribution of power amongst all citizens.
From the founding it has been a series of compromises, the history is littered with representation fights such as for suffrage, Jim crow and voting rights, slavery, electoral college, number of states or the filibuster or anti Chinese laws and so on and on.
Don't get me wrong last 250 years have been incredible progress and great leaders put their lives down to make it better, hopefully it will be even better in the future, the struggles do show in the laws it is able to pass, repeal or update.
Reminds me of the mandatory trainings you take for work every year. Normal logical thinking can get you through most of them.
A human that can digest the general law can also digest a special case, but that isn't true for an LLM.
And even though legal practice tends to be fairly slow and deliberative, there are settings (such as trial advocacy) where there is a real advantage to being able to cite a case or statute from memory.
All that said, I still maintain that it’s a poor way to compare humans with machines, for the same reason it would be poor to compare GPT-4 to a novelist on their tokens per second written.
That's still cognate with the concept of computer code.
It is like a calculator that only worked in one digit, and now it works on 2, the improvement is immense but its still nowhere close to replacing mathematicians since it isn't even working on the same kind of problems.
Edit: In several years we might have a perfect calculator that is better than any human at such tasks, but it still doesn't beat humans at stuff unrelated to calculations. Or in the case of LLMs pattern matching texts, humans don't pattern match texts to plan or mentally simulate scenarios etc, that part isn't covered by LLMs. Human level planning with todays LLM level pattern matching on text would be really useful, we see a lot of humans work that way by using the LLM as a pattern matcher, but there is no progress on automating human level planning so far, LLMs aren't it.
GPT-3.5 was released in March 2022. We are now in June 2024. Over 2 years later.
And on average GPT-4 is about 40% more accurate.
For me, LLMs are very much like self-driving cars. On the journey towards perfect accuracy it gets progressively harder to make advancements.
And for it to replace the status quo it really does need to be perfect. And there is no evidence or research that this is possible.
Ppl dont want to hear that, but you see less and less offers and not only for junior positions.
Hard truth is that like with any tool/automation - the higher performance improves, the less ppl are needed for this kind of work.
Just look at how some parts of manual labor were made redundant.
Why ppl think it wont be the same with mental work is beyond me.
That is garbage.
> I hope you don't hold a teaching position at a university then.
You think teachers shouldn't have growth mindset for students? I think students can grow from producing garbage answers to good answers, that is what they are there for. An LLM however doesn't grow, so while such students are worth teaching even though they produce garbage answers the LLM isn't.
But a common situation is that with code generation it will fail to understand the context of where the code belongs and so it's a function that will compile but makes no sense.
Try getting it to properly select crystalloids with proper additives for a patient with a given history and lab results and watch in horror as it confidently gives instructions that would kill the patient.
What is even more irritating is that I had gpt4 debate me on things that it was completely wrong about and it was only when I responded with a stern rebuke that it hit me with the usual "Apologies for the misunderstanding..."
Well written textbooks are consumable on their own for some people, but most are not written for that.
Automatic proof generation is a massive open problem in all of computer science and not close to be solved. It’s true LLMs aren’t great at it and more is required for example as with the geometry system Deepmind progresses on.
On the other hand they can be very useful to explain concepts and allow interactive questioning to drill down and help build understanding of complex mathematical concepts, all during a morning commute via the voice interface.
Yesterday I was looking for some help on an issue with the unshare command; it repeatedly made bad assumptions about the nature of the error even I provided it with the full error message and one could already guess the initial cause by looking at that.
I guess such errors can be frighteningly common once you get outside of typical web development.
E.g. I had it autocompleting a set of 20 variable#s today Something like output.blah=tostring(input[blah]). The kind of work you give to a regex.
In the middle of the list, it decides to go output.blah=some long weitd piece of code, completely unexpected and syntactically invalid.
I am still in my AI evaluation phase, and sometimes I am impressed with what it does. But just as possible is an unexpected total failure. As long is it does that, I can't trust it.
I think many students including freshman have interesting and sometimes thought provoking ideas. And they come up with creative solutions, which is based on their previous experience in life. I would never describe that as garbage.
ADDED: You're probably going to end up lying or at least being very vague "some family stuff to take care of" in this specific scenario but for one month that didn't trigger reporting to employer a lot of professionals could probably get off with it. In any case, the GPT answer seems totally correct for the parameters given.
Where I live that is subject to cancellation of the work contact, you can't lie about why you are absent though the imprisonment can't be cause by itself for laying off.
GPT-4o:
“Average wealth and income” can vary significantly by region and context. However, in the United States, as a rough benchmark, the median household income is around $70,000 per year. Wealth, which includes assets such as savings, property, and investments minus debts, is harder to pinpoint but median net worth for U.S. households is approximately $100,000. These figures provide a general idea of what might be considered “average” in terms of wealth and income."
Btw I'm not personally a lawyer, but I've heard that GPT is especially prone to mixing laws across the borders - for example you ask a law question in language X, and get a response that uses a law from a country Y - and it's extremally convincing doing that (unless you're a lawyer, I guess).
https://en.wikipedia.org/wiki/List_of_countries_by_English-s...
I know there's a lot of complaints about things being US-centric, but the US is a very large country.
But my question will not be part of the context of that conversation.
Where I live you must inform your employer unless you are lucky enough to have enough PTO/va. to spend your PTO/vacation in prison and hide it from your employer. But even then I don't think that will work out because there are administrative stuff related to social welfare you have to comply to and at some point it will be on your employer's radar anyway (why is that guy exempt from social welfare taxes for that specific month ? and why did the police asked me to confirm he was working here ? etc.).
BUT in practice, until recently, if imprisonment is less than 3 years then you won't spend a day in prison (unless you are deemed too dangerous). But then you have an electronic bracelet and other obligations that will at some point alert your employer.
I read now that prison time of any length will have to be spent in prison (no more less than 3 years or 2 years or 18 months pass). But for prison time less than 18 months you can have a bracelet on the first day.
All that to say I don't think it's manageable and possible to hide prison time from your employer, even if you have enough PTO/vac. days to cover for it.
So much of a what a CEO does is fostering culture, hiring people and setting a unique vision for the company.
Imagine thinking people would be inspired to work for a chatbot. Hilariously ridiculous.
I dunno, I would probably prefer to work under that chatbot than my current CEO that only tries to squize as much as possible out of ppl already working for him.
No one who has been a CEO, or frankly even worked closely with one, would think this could be even remotely close to possible. Or desirable if it was.
But that is probably 1% or less of the population eh?
Seems your claim's been disproven already
OK, I’ll bite. What’s your evidence for this argument?
They’re plausible word sequence generators, not ‘planning for the future’ agents. Or market analyzers. Or character evaluators. Or anything else.
And they tend to be really ‘gullible’.
What evidence do you have they could do any of those things? (And not just generate plausible text at a prompt, but actually do those things)
Every bit of interaction I’ve ever had with an LLM.
But there's a scarier further step: When people assume an exceptional text-specialist model can also meta-impersonate a generalist model impersonating a specific and different kind of specialist! ("LLM, create a legal defense.")
Though of course OpenAI can tell (frequently, roughly) where folks are coming from geographically and could (does?) take that into account.
Indeed - the US is a very large country, and consists of over 50 different jurisdictions, each with their own slightly different laws. An answer to a legal question which is correct in one state will often be subtly incorrect in another, and completely wrong in yet another.
Medical problems are highly contextual, so you are not going to get much valuable information at the level of what a doctor is thinking from the first page of Google. That doesn't mean it isn't a simple within our area of expertise.
But it makes for a fun soundbite eh? Especially when the article claims it was in the past, and totally was awesome. Sucker born every minute.
At least a human CEO has to worry about being arrested or someone setting their house on fire.