AI outperforms law professors in Stanford Law study

I find this study quite suspect. I'd have to dive deeper but there's definitely significant alarm bells that should be going off for anyone reading.

Figure 2 (page 6) screams problems. There's only 16 professors (3k comparisons each?!?!) and the professors are all over the place. That's very high variance, suggesting the study has no meaningful statistical power. Poor instructor 16 can't catch a break lol

There's also really clear bias given that the main results only feature Google models. Other models show up elsewhere, why not there?

I'm no lawyer, but I'm a pretty competent statistician and can confidently say this paper has a smell to it. I can't call it bullshit, but there are red flags all over

In many (most?) countries you can defend yourself, waive your court appointed attorney. You are of course highly discouraged to do so. But sometimes people do it, mostly for smaller claims where they don't want to rack up legal bills for things which might cost more than what is at stake.

But, it makes me wonder, will clients be able to use these AI-attorney systems in the future, in the court. Where they basically either just parrot what the model is instructing them to do, or - I dunno - give the model permission to speak for them (while waiving liabilities).

I have no doubt that some complex AI system can perform better than a bottom-tier, overworked lawyer.

As a software engineer I have some intuition for what the risks are of letting agents do some tasks vs others.

I don't have a similar intuition calibrated for what could go wrong when asking AI to draft a legal document. Some things seem harmless, i.e. drafting a will, but I don't really know- our legal system is notoriously rife with footguns.

Tangential, is there a "test suite/CI" for AI writing legal documents? Long back in terms of AI progress, a lawyer filed something with hallucinated sources. Do new tools prevent this?

I understand why the conversation on this article looks like it does, but the study is specifically focused on the potential for LLMs to operate as tutors for law students. I enjoy the extrapolation out to whether LLMs will replace lawyers, but did not find that to be discussed in the study itself.

In the framing of using LLMs as legal tutors, with the implication of lowering the cost of legal training, this seems like a socially-positive outcome. Furthermore, it feels kind of intuitive to me that any contemporary system operating with an LLM and access to legal reference material will be prepared to answer _student-originated questions_ comprehensively and with breadcrumbs or direct references to educational/source materials, as seems to have been found in the study.

The authors explicitly and intentionally emphasize that many legal questions require contextualization, as opposed to some discrete calculated answer. The result of the study implies that the LLM-based systems were capable of using what many of us here understand to be the "stochastic best-fit algorithmic generation" of a contemporary language model to adequately contextualize a student's question, providing insight into the trade-offs or complications implicit in the question, while then, critically, _meeting the professional standards of legal educators in explaining that complexity to a student_.

Realistically, I would hope this provides some confidence to readers of HN that they can actually ask a legal question to an LLM and expect the response will explain the complexity of the law in relation to the question. This is great news, and is likely the minimal pre-work any of us should do before actually consulting a lawyer, if time permits.

On the other hand, I do _not_ think that this study provides any indication that an LLM is prepared to actually provide direct legal counsel. Possibly in the same way that a legal textbook does not replace legal counsel, or perhaps more accurately, the same way that stumbling upon a legal case study for approximately the same situation you're in doesn't guarantee you'll have the same result.

I'm surprised Stanford Law would go along with this over-reaching press release title. How about "For common first-year contracts-law questions, law professors preferred AI-generated answers to professor-generated answers"

I'd read this less as "AI replaces law professors" and more as "AI may be a surprisingly strong first-pass tutor, especially when the student knows enough to question it"

My best guess is that Gemini was trained on the textbooks that the questions are meant to test against, thus they are probably better at explicit recall of those questions or related questions.

This is a pretty limited introductory course based on what it says in the methods of the paper itself.

What the LLM cannot do is explain why it said what it said, when cross-examined. It simply hallucinates the best account of why someone would have said such a thing as it said, same as it can give a probable account of why someone else said something different. The question 'But why did you say this not that ...?' does not lead it to make explicit its grounds for what it said, but just to make a new more complicated statement.

I do question at what point AI could be useful as a teaching aid.

The quality of LLMs depends heavily on, among other things, how you word your questions.

Knowing the correct questions to ask is not something most students know how to do given that it tends to require a fair bit of pre-existing domain knowledge.

It is important for society to understand it is not merely programmers and customer support who are at risk of losing their jobs. Clearly A.I can do much more than just program.

Oh, a "Human-Cented" study by AI lover:

Julian Nyarko

    Professor of Law
    Co-Chair Stanford Law AI Initiative
    Senior Fellow, Stanford Institute for Human-Cented AI (HAI)

LOL!

16 is such a small number for what they phrase as an important finding. It really couldn't be much harder to coordinate with 100+ professors.

One way to make legal services more affordable and accessible would be to put the burden of ensuring the AI legal services are accurate on a private-public partnership with the government.

If a person using the service is given inaccurate legal advice and acts on that advice, the person can't be charged with a crime, can't be given any civil penalties, etc., as long as the law in question is non-obvious.

Obviously if by some exploit, some fundamentally obvious crime (murder, theft, obvious fraud, etc.) is said to be legal, that wouldn't apply, but of course the service should try to prevent those kinds of exploits anyway.

Could limit this to something like business regulations to begin with, or even specifically for small businesses, or contracts within some time limit and dollar amount that would otherwise be coverable by small claims court, etc.

When I see news pieces like this I wonder about the failures. Maybe the failure percentage is low but what happens if a bot gives bad counseling? Who is responsible then?

Attorneys will be using LLMs for convenience but they will not disappear, because there needs to be an ultimately human responsible of the decisions.

Curious how they do a “blind” preference test. To any evaluator I’m sure it’s quite clear which answer is AI vs human.

Honestly it's not surprising that AI provided answers that were flagged less often as "pedagogically harmful" if we take in account that somehow LLMs create an "average" of all knowledge they ingested.

I'm going to need some legal help for my startup. But I can't pay much. So I figured I will ask AI all relevant questions, as well as forms filled etc. Perhaps even create a patent-application for me.

THEN I find a human lawyer and give AI's answers to them and say "Can you find any errors in this? Can you improve it?" .

That way I think my legal bills should be smaller because the AI has already done most of the work. What do you think? Which LLM is best for legal work?

In the hands of a domain expert, AI is useful. In the hands of the naive, it is a foot gun.

I killed my Arch installation and was stuck at the GRUB prompt.Unwilling to brush up my rusty knowledge of GRUB syntax, I asked Gemini for help. The commands Gemini suggested would have wiped my hd...

Once Gemini was told that I was using BTRFS, the suggestion from Gemini looked a bit more sane, but still looked incorrect to me.

It was only after I informed Gemini that I was using a NMVE with BTRFS that it finally produced a sane command.

I beat lawyers twice before generative AI even existed. Recently I asked Gemini a few questions about personal conflicts in everyday life. It's often too conservative, with views too shallow for the problem. So I still handle human conflicts myself. I only outsource the templated stuff like routine chat replies or marketing copy though it saves me huge amount of time. People who quote AI in serious conflicts are too weak to handle them on their own.

Yes, LLMs are great at search. That's not news.

> rated AI responses significantly higher than answers written by other professors, with AI winning 75% of head-to-head matchups.

That's the problem, you never know when the 25% deliver a true stink bomb, and that's not considering prompting - while a fair prompt/question maybe considered objective, it's very easy to stray.

Question is: if a legal question is answered incorrectly by an LLM, who is going to be held responsible?

* Gemini 2.5 Pro (no outside resources), and * NotebookLM (not versioned -- with added legal resources).

NotebookLM was considered slightly better than 2.5 Pro by the evaluators.

Yeah this could be interesting. A lot of the spotlight has been on “law firm stuff” like demand letters and writing contracts…

But imagine if a dev team didn’t have to go engineer -> product manager -> legal team to get a question answered on local data retention requirements. You could ship that much faster.

> In a blind evaluation of nearly 3,000 anonymized comparisons, professors rated AI responses significantly higher than answers written by other professors, with AI winning 75% of head-to-head matchups.

75% win rate seems pretty good!

Paper link: https://law.stanford.edu/wp-content/uploads/2026/06/salinas_...

I think there will be a market for firms that aggressively market themselves as non-AI, and then as more people turn towards that human connection we'll go full circle

This contradicts my anecdata.

Recently, I tasked Opus 4.6 to study a new Czech building permit law in conjunction with some waste disposal regulations and the result was disappointing. The model could not stop drawing conclusions from obsolete regulations in its training dataset, even when given the fulltext of the new law. The usual "you are totally right" also applied and its conclusions were most of the time obviously wrong even to a human with cursory knowledge of the subject.

I ended with studying the relevant regulations myself over the weekend.

Incredible that the common people will be able to wrestle the right to rule of law away from the bloated legal caste, who have built themselves quite the moat.

The inaccessibility of justice is a huge driver of inequality. Any tools which bridge this gap will help make a more just society.

What is the point of this conclusion? That law professors like the tone and verbosity of AI slop? Okay?

Personally I think this is very good. One of the hardest things out there is maintaining a society in the face of changing times and it's because law is dense and slow.

I think, in the right hands, this could be huge.

While they provided the questions that professors and LLMs were asked to respond to, they don't include any of the answers from either the humans or the LLMs, so there's no way to independently verify that the LLMs actually returned "better" answers.

Given the number of responses the professors were asked to rate (200 each), they probably graded them the same way that bar exam responses are graded: quickly and superficially. Not surprising that LLMs achieved higher scores in this scenario, since they excel at producing superficially nice answers that don't hold up under scrutiny.

Also...unless statistics has changed in the past 2 decades, the math in the charts doesn't math. That's probably why they're leaving out the actual numerical data. I also wouldn't be surprised if we learn in the coming days that the charts were AI generated.

I skimmed portions of the study but didn't manage to figure out whether this actually measures a preference for confident mediocrity.

AI will never convince a jury though.

Library outperforms student... more news at 9

More great news from the prestigious university where 40% of students claim they are disabled

https://fortune.com/article/rise-in-elite-students-seeking-a...

and where they wanted to ban words such as "chief", "stupid", "karen" and "American"

https://reason.com/2022/12/21/stanford-elimination-harmful-l...

He is basically an AI professor for law. This study just confirms his existence:

https://juliannyarko.com/

Stanford and its donors of course want to replace anyone but its administrators, so they cheer on such anti-intellectual nonsense.

Yes yes, the IPO is near.

Marc Andreessen argued that we've already reached AGI. He says that the top AI models give better answers than 99% of people he has access to, and he has access to some of the best people in their field.

I'm getting more convinced. I mean, sure it makes dumb mistakes sometimes but its a particular set of self serving mistakes, commenting out tests in order to pass. We obv don't want this behavior but I wouldn't say it's dumb.

It'll be like the Turing test, which we just blew past years ago and no one cared. After all the hand-wringing about sentience and rights of the AI if it passes the Turing test, and now we just have AI bots running 24/7 writing slop.

How does everyone else feel?

And this was done with Gemini 2.5

By the time any research study is done on AI is published the models are already 0.5-1 generation ahead. Even this bullish outcome for AI models and their ability to perform useful work does not reflect how good they are now.

There is quite a simple solution for many of the problems described in the comments: Make drafting legal papers a defined interface.

If you think about it and extract sematics of any law you get something that looks familiar, sort of like code. Of course there's some complexities where certain phrases can mean different things, but legal papers in a way are written like they're programming languages already especially when it comes to law.

First we would have to define a language that can handle ambigious operations and we alread y have this with programatic proofs where n should land in x. So in the end I'd assume it would look something like this in a two party dispute:

This is very simplified and pseudo like language, writing out a full contract would be as long as a real contract.

     DEFINE DEFENDANT "A Corp"
     DEFINE PLAINTIFF "B Corp"
     DEFINE CONTRACT  CONTRACT(PLAINTIFF, DEFENDANT, 3054-41-95)

     // attaching extracted requirements, definitions and obligations of contract

     FACT   PLAINTIFF delivered(goods) ON 7054-34-99
     FACT   DEFENDANT paid(0) OF CONTRACT.amount

     CLAIM  breach WHEN obligation(DEFENDANT, "pay") IS NOT satisfied

     PROVE breach:                                                                                                                                                                  
         REQUIRE  PLAINTIFF performed                                                                                                                                               
         REQUIRE  DEFENDANT.paid < CONTRACT.amount                                                                                                                                  
         ASSERT   delay WITHIN reasonable(time)

     IF PROVE(breach):
         AWARD PLAINTIFF (CONTRACT.amount - DEFENDANT.paid) + interest()
     ELSE:
         DISMISS

Then you would run a proof based LLM to generate it into target language and since we already had an example of this from one of the AI labs we know it works. Automatic citations and supporting proof would be automatically populated from reviewed legal -> DSL extracted papers as supporting evidence.

I am sure that many AI labs are working on something similar already and we will see something like that in the near future as proof based llms evolve.