If a person using the service is given inaccurate legal advice and acts on that advice, the person can't be charged with a crime, can't be given any civil penalties, etc., as long as the law in question is non-obvious.
Obviously if by some exploit, some fundamentally obvious crime (murder, theft, obvious fraud, etc.) is said to be legal, that wouldn't apply, but of course the service should try to prevent those kinds of exploits anyway.
Could limit this to something like business regulations to begin with, or even specifically for small businesses, or contracts within some time limit and dollar amount that would otherwise be coverable by small claims court, etc.
The quality of LLMs depends heavily on, among other things, how you word your questions.
Knowing the correct questions to ask is not something most students know how to do given that it tends to require a fair bit of pre-existing domain knowledge.
Attorneys will be using LLMs for convenience but they will not disappear, because there needs to be an ultimately human responsible of the decisions.
Julian Nyarko
Professor of Law
Co-Chair Stanford Law AI Initiative
Senior Fellow, Stanford Institute for Human-Cented AI (HAI)
LOL!I killed my Arch installation and was stuck at the GRUB prompt.Unwilling to brush up my rusty knowledge of GRUB syntax, I asked Gemini for help. The commands Gemini suggested would have wiped my hd...
Once Gemini was told that I was using BTRFS, the suggestion from Gemini looked a bit more sane, but still looked incorrect to me.
It was only after I informed Gemini that I was using a NMVE with BTRFS that it finally produced a sane command.
That's the problem, you never know when the 25% deliver a true stink bomb, and that's not considering prompting - while a fair prompt/question maybe considered objective, it's very easy to stray.
This is a pretty limited introductory course based on what it says in the methods of the paper itself.
Figure 2 (page 6) screams problems. There's only 16 professors (3k comparisons each?!?!) and the professors are all over the place. That's very high variance, suggesting the study has no meaningful statistical power. Poor instructor 16 can't catch a break lol
There's also really clear bias given that the main results only feature Google models. Other models show up elsewhere, why not there?
I'm no lawyer, but I'm a pretty competent statistician and can confidently say this paper has a smell to it. I can't call it bullshit, but there are red flags all over
But, it makes me wonder, will clients be able to use these AI-attorney systems in the future, in the court. Where they basically either just parrot what the model is instructing them to do, or - I dunno - give the model permission to speak for them (while waiving liabilities).
I have no doubt that some complex AI system can perform better than a bottom-tier, overworked lawyer.
I don't have a similar intuition calibrated for what could go wrong when asking AI to draft a legal document. Some things seem harmless, i.e. drafting a will, but I don't really know- our legal system is notoriously rife with footguns.
In the framing of using LLMs as legal tutors, with the implication of lowering the cost of legal training, this seems like a socially-positive outcome. Furthermore, it feels kind of intuitive to me that any contemporary system operating with an LLM and access to legal reference material will be prepared to answer _student-originated questions_ comprehensively and with breadcrumbs or direct references to educational/source materials, as seems to have been found in the study.
The authors explicitly and intentionally emphasize that many legal questions require contextualization, as opposed to some discrete calculated answer. The result of the study implies that the LLM-based systems were capable of using what many of us here understand to be the "stochastic best-fit algorithmic generation" of a contemporary language model to adequately contextualize a student's question, providing insight into the trade-offs or complications implicit in the question, while then, critically, _meeting the professional standards of legal educators in explaining that complexity to a student_.
Realistically, I would hope this provides some confidence to readers of HN that they can actually ask a legal question to an LLM and expect the response will explain the complexity of the law in relation to the question. This is great news, and is likely the minimal pre-work any of us should do before actually consulting a lawyer, if time permits.
On the other hand, I do _not_ think that this study provides any indication that an LLM is prepared to actually provide direct legal counsel. Possibly in the same way that a legal textbook does not replace legal counsel, or perhaps more accurately, the same way that stumbling upon a legal case study for approximately the same situation you're in doesn't guarantee you'll have the same result.
If you think about it and extract sematics of any law you get something that looks familiar, sort of like code. Of course there's some complexities where certain phrases can mean different things, but legal papers in a way are written like they're programming languages already especially when it comes to law.
First we would have to define a language that can handle ambigious operations and we alread y have this with programatic proofs where n should land in x. So in the end I'd assume it would look something like this in a two party dispute:
This is very simplified and pseudo like language, writing out a full contract would be as long as a real contract.
DEFINE DEFENDANT "A Corp"
DEFINE PLAINTIFF "B Corp"
DEFINE CONTRACT CONTRACT(PLAINTIFF, DEFENDANT, 3054-41-95)
// attaching extracted requirements, definitions and obligations of contract
FACT PLAINTIFF delivered(goods) ON 7054-34-99
FACT DEFENDANT paid(0) OF CONTRACT.amount
CLAIM breach WHEN obligation(DEFENDANT, "pay") IS NOT satisfied
PROVE breach:
REQUIRE PLAINTIFF performed
REQUIRE DEFENDANT.paid < CONTRACT.amount
ASSERT delay WITHIN reasonable(time)
IF PROVE(breach):
AWARD PLAINTIFF (CONTRACT.amount - DEFENDANT.paid) + interest()
ELSE:
DISMISS
Then you would run a proof based LLM to generate it into target language and since we already had an example of this from one of the AI labs we know it works. Automatic citations and supporting proof would be automatically populated from reviewed legal -> DSL extracted papers as supporting evidence.I am sure that many AI labs are working on something similar already and we will see something like that in the near future as proof based llms evolve.
https://fortune.com/article/rise-in-elite-students-seeking-a...
and where they wanted to ban words such as "chief", "stupid", "karen" and "American"
https://reason.com/2022/12/21/stanford-elimination-harmful-l...
NotebookLM was considered slightly better than 2.5 Pro by the evaluators.
By the time any research study is done on AI is published the models are already 0.5-1 generation ahead. Even this bullish outcome for AI models and their ability to perform useful work does not reflect how good they are now.
Given the number of responses the professors were asked to rate (200 each), they probably graded them the same way that bar exam responses are graded: quickly and superficially. Not surprising that LLMs achieved higher scores in this scenario, since they excel at producing superficially nice answers that don't hold up under scrutiny.
Also...unless statistics has changed in the past 2 decades, the math in the charts doesn't math. That's probably why they're leaving out the actual numerical data. I also wouldn't be surprised if we learn in the coming days that the charts were AI generated.
Recently, I tasked Opus 4.6 to study a new Czech building permit law in conjunction with some waste disposal regulations and the result was disappointing. The model could not stop drawing conclusions from obsolete regulations in its training dataset, even when given the fulltext of the new law. The usual "you are totally right" also applied and its conclusions were most of the time obviously wrong even to a human with cursory knowledge of the subject.
I ended with studying the relevant regulations myself over the weekend.
The inaccessibility of justice is a huge driver of inequality. Any tools which bridge this gap will help make a more just society.
Stanford and its donors of course want to replace anyone but its administrators, so they cheer on such anti-intellectual nonsense.
I think, in the right hands, this could be huge.
I'm getting more convinced. I mean, sure it makes dumb mistakes sometimes but its a particular set of self serving mistakes, commenting out tests in order to pass. We obv don't want this behavior but I wouldn't say it's dumb.
It'll be like the Turing test, which we just blew past years ago and no one cared. After all the hand-wringing about sentience and rights of the AI if it passes the Turing test, and now we just have AI bots running 24/7 writing slop.
How does everyone else feel?
75% win rate seems pretty good!
Paper link: https://law.stanford.edu/wp-content/uploads/2026/06/salinas_...
THEN I find a human lawyer and give AI's answers to them and say "Can you find any errors in this? Can you improve it?" .
That way I think my legal bills should be smaller because the AI has already done most of the work. What do you think? Which LLM is best for legal work?
But imagine if a dev team didn’t have to go engineer -> product manager -> legal team to get a question answered on local data retention requirements. You could ship that much faster.
However, the good news is that a whole bunch of laywer positions in drafting docs and research will be able to be eliminated due to AI.
The authors point out that this other metric was computed in prior work and incorrectly dismiss it as being not as good as winning percentage in head to head competitions. The cited prior work shows that the models fare poorly on that metric. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5166938
In 'critical' industries, the error rate is massively important, and if the quality of search is reaching an acceptable error rate, that's quite big news.
He has access to employees and yes-men. What he actually needs to hear, nobody will tell him, AI even less so. Every shit idea he has, would be "what a bright idea"-ed by both everyone around him and AI.
And of course there's the little matter that he makes money and increases his power by selling AI. What seller doesn't promote their stuff as the greatest ever?
Humans have the advantage of perspective. We always lack some knowledge and answer broadly. This is bad if you have a particular goal in mind, but better if you're just generally learning, because you see more and learn to discriminate the correct from the wrong. And most importantly, being wrong is part of human ingenuity - because sometimes we turn something "obviously" wrong into something right.
Investor with vested interest in AI companies makes claim of reaching "AGI".
He is one of the last people to listen to about AGI. Unless the term "AGI" means something entirely different to him vs to independent researchers vs to CEOs, since the term has become entirely meaningless.
He stands to make billions if enough people believe him — unless you also do, consider that you’re the mark. For example, if that was true, it would have to mean that AI companies either aren’t letting customers use the good models or are instructing them to frequently make errors which reveal a fundamental lack of reasoning ability.
Consider also that his wealth means he hasn’t had to defend an idea stringently since the 90s. I wouldn’t be surprised if he does think LLMs give deep answers because it often looks that way until you critically review the response and ask questions like what’s missing which require you to have a decent understanding of the problem domain.
I also think it’s easy to think that AI gives good answers if you don’t know the field well. In fields where I know the material, the answers are pretty variable and can be quite bad.
The main results also don’t seem to know what a “model” is, as the two “models” it refers to are “stock Gemini 2.5 Pro” and “a retrieval-augmented version of NotebookLM”.
One of which is a model, and the other of which is an interface backed by different models depending on exactly when the analysis was performed.
In such a framing I don't find it surprising at all that teachers prefer the more polished answers generated by AI, because if LLMs are good at one thing, it is being confident in whatever they generate and present it convincingly.
There was another thread about the impact of AI on maths, and one of the arguments was about peer review... Made me wonder whether the writer was more concerned about the established order and gates being upset, or whether there's actually a valid technical criticism.
i do second phase on codex, by asking to download all pdfs and extract all text of laws it references. can repeat fully local research step.
after i ask gemini to find issues and criticize.
UPDATE: there many legal skills on github to try, not used so any yet
Please see attached contract we received from [counterparty]. ChatGPT says blah, blah and blah should be revised. What do you think? Is there anything else that we should change?
Model interpretability work has advanced a lot. Arguably we already can explain AI decision-making better than human brains.
There are however LLM context building techniques that anchor completions in data structures that persist the structure of claims that support the conclusion contained in a completion. Lots of different patterns exist —organizing logic in language is a rich domain— but the one I’ve liked the most is something called a Claim Dependency Graph that models the relationships between atomic claims as graph edges.
There’s a whole suite of operations you can perform on these structures, and “reconstruct how you came to this conclusion” is absolutely one of them.
I don't think there will be any such market for "non ai" law. If I'm involved with the legal system I just want out as quick as possible as cheap as possible.
Asking the LLM in a way where it annotates its sources, it can greatly increase the pattern matching to closely simulate logic, just like in humans.
I understand the question of why did you say this, not that, I have seen other ways of asking that which do not seem to trigger the LLMs over-response in the other direction.
EDIT: just found out that Google is a major donor to HAI. So this research is at least partially funded by Google. Which is probably the reason the authors fail to declare no conflict of interest.
...On the other hand, if an LLM has access to every transcript of every case a Judge has overseen, they might have an unfair advantage in any case... Hmmm...
This all assuming the AI lawyer doesn't hallucinate and start referencing cases that don't exist.
Any lawyer who isn't using LLMs for research is behind the curve, though. They are unbelievable at finding niche cases you would never have found on your own. Previously it was a lot of exact search term matching, which is inherently useless for a lot of legal research. I need something that can search on vaguer terms, which AI can do incredibly well. Just check the results. I'm sure the LLMs from Lexis Nexis/Westlaw are probably better than the general purpose ones.
LLMs make fantastic paralegals. If you're doing any legal work, you should be using it, even if it's just to shoot ideas at. Have it play devil's advocate. My friend always has it play the other party's lawyer to see what all the counter-arguments are going to be.
Just like you would with software development. If you care about what you are creating, CHECK THE OUTPUT.
> As judges, the professors then completed 2,918 blinded, forced-choice comparisons (median per judge: 200), each time indicating which of the two anonymized responses, from the instructor or the LLM, they would rather give to a student
Even saw some where they just slapped interviews + protocol into chatgpt as 'methodology' to extract the results -_-. Peer reviewed and published.
I liken it to me googling things as a sysadmin vs. Jane from accounting doing it. The non-tech end user is far more likely to make the problem worse, or install something sketchy from the ad riddled results than I am, or one of my help desk employees are.
I wouldn't trust myself to draft an important legal document using AI without the advice of a lawyer, much like I wouldn't really want to rely on my lawyer to use AI to write code for me.
One wrong advice clump and, like a step onto the wrong path while hiking, all subsequent steps go in the wrong direction. And sycophancy tuning means marginal one-sides takes get presented as sure-fire things.
I’m of the opinion that the big wins aren’t in using the LLMs to do the work (legal, in this case), but rather to refine and improve the dialog and presentation from all parties. A court-centric LLM that could give likely procedural needs to a litigant, and a law-firm-centric LLM could help a pro se litigant create a meaningful and refined set of questions for lawyer consideration, condensed and targeted, saving all parties time and confusion while meeting the clients linguistic needs ‘where they are’.
All the lawyers know things LLMs never will, the law is interpreted, and the written part isn’t engineering grade facts but suggestions interpreted in context. Arguably this is a racket and a thin veneer of plausible deniability for authoritarian rule. But as the law stands even with federal statues and citations from the courts website, practicing lawyers will frequently end up explaining that in this county/country/court/jurisdiction The Way of Things is different.
The time lag between drafting and "deployment" also makes for much less effective, much more expensive debugging loops. You can deploy your code to prod in seconds, see an error pop up in the logs, and immediately start debugging. But it will take at a minimum days and frequently as long as several years before an error in a contract or a court filing will be detected, and often the error is beyond correction at that point. Thus, the errors are both more difficult to detect and to resolve.
And the consequences of error are often much greater, both because they are not correctable and because a legal error may risk someone's life, liberty, or substantial property. Although that's not categorically the case, obviously bugs in certain safety critical systems can be as bad or even worse than legal mistakes. But in general, most software is lower stakes than most legal writing.
On the flip side, LLMs do seem to do a better job with basic style and structure for legal documents compared to code. Things like following IRAC format, citing assertions of law (although hallucination remains an issue), and writing comprehensible sentences. These would be the equivalents in code to best practices like good comments, cohesion, consistent use of design patterns, test coverage, clear variable names, DRY, etc. Although the better performance on those more qualitative metrics may just be because even the longest legal documents are typically simpler in structure and have fewer lines of text than a large, complex codebase. Or maybe it's because LLMs are trained on natural language text more than on code. Or because natural language is more forgiving than code, in that minor variation in diction or grammar is unlikely to have any significant effect on how the document is interpreted, whereas even single character errors in code can have enormous effects.
Absolutely not harmless if you're the executor of an estate forced to deal with a screwed up AI will. I just handler my dad's estate this spring. It's a frustrating and confusing process even with the simplest of estates.
But is it a surprise law professors aren't great statisticians?
Such a document may not make a difference to the person that eventually will have died, but it can make or break the life of generations to come in countries that are so heavily optimized for dynasty building like the US.
I think it indicates that LLMs are smart enough to be used in the context of law education.
Believe it or not...
A lot can go wrong if you have real life human lawyers draft a legal document.
AI is not only replacing programmers, but art and the meaning of being human itself. It's showing us how trivial all of human creation is as it's just patterns from an algorithm.
He makes billions but he already is a billionaire. Gaining billions more doesn't mean shit. The guy really has nothing to lose and the utility of what he gains contribute little to his life style.
I will tell you this. HN has been comically wrong about everything related to AI. They said driverless cars have no chance of becoming useable. Now Tesla FSD is almost there and I sleep in waymo cars. HN said AI will never code, now everyone uses it to code.
It's fucking stupid. This is one of the smartest forums on the internet but HN becomes next to stupid when predicting AI. Why? Because humans can't face the truth. When the victim of attack is yourself, it doesn't matter how smart you are... you have to scaffold a rationalization to spare yourself as the victim. You have to lie to yourself and tell yourself that you matter.
The truth of it is, while LLMs are not the end game, AI in general is on a trajectory to take over. It shows us how meaningless our skills are... not only as programmers but as artists. That beautiful song you felt had greater meaning? It's all reproducible via an algorithm because it never really had a greater meaning. It was just a pattern.
The point is familiar but there are good illustrations in the Atlantic article by a book editor. At first it seems abstract AI hate, but then she gets to the details. AI text cannot be edited. https://www.theatlantic.com/technology/2026/05/how-to-tell-a... or https://archive.ph/YJsGK
Convincing a human law professor to click the "I would prefer to deliver this response to a student" button, and to not click the "this response is pedagogically harmful" button is a different task!
I could imagine an LLM convincing a typical human to click the "I like this one better" button with flattery, or with nice-sounding platitudes, or with hand-wavey explanations that sound plausible. And in fact that's exactly what LLMs do when they go wrong - they bluff and output superficially plausible nonsense!
But these weren't typical humans, these were law professors specifically tasked with deciding which response was a better option to give to students as a canonical answer to a contract law question. So I think this is a genuinely impressive result.
For testing, I've asked (admittedly last-gen) LLMs to generate legal opinions regarding issues in commercial English civil litigation, and I received back cases where the citation is real, but the area of law (family law) is not relevant as family courts apply a very different set of procedural rules.
(If you squint a bit, they sometimes might be relevant... and could be useful for a particularly creative litigator to make a novel argument on behalf of a very risk tolerant client. But you would very much want to go read those cases and think quite hard about them.)
One possible interpretation, the statements were very bland. These would be very low harm but also not very informative
With that kind of logic ... anything is possible.
That isn’t even remotely what this study is looking at.
Another domain where LLMs are very effective at confidently leading people down a messy path. I have a roommate using LLMs to guide him through setting up some ollama stuff in my WSL (I happen to have the half-decent GPU here) and after multiple rounds of the bot trying to get him to do things that were redundant if not in the wrong direction entirely (and vaguely insulting as a matter of course), I had to write "ground truths" along these lines, and probably more as I find them:
We are using systemd. ~/.bashrc or similar dotfiles should not be used to start services/processes automatically. Do not "sudo" anything in ~/.bashrc.
[Yes, it did that] A systemd service should be created for any processes/services that need to run automatically and persistently. The current output of `systemctl list-unit-files | grep enabled` is available at [ . . . ]
sshd is already enabled + running and listening on 0.0.0.0:22 and [::]:22. ~/.ssh perms are already 700 and ~/.ssh/authorized_keys perms are already 600. Public key authentication is already enabled in sshd and ~/.ssh/authorized_keys already contains pubkeys ENDING as follows: . . .
tailscaled is already enabled + running; the tailscale address for [host] is [addr]
It is not necessary to fix connectivity to any 192.168.0.0/16 ; tailscale interface should be used for any traffic to [host] or other hosts involved in the project; hosts/nodes lacking tailscale interface should be assigned one
[roommate + bot spent 45 minutes on trying to configure their way through NAT when not having to do that is almost the entire point of tailscale. It was just (essentially) like, "You're absolutely right. We have tailscale set up, so we don't need to be able to ssh to that other interface at all. Not troubleshooting that would have saved 45 whole minutes. Oh well, now what?"]Maybe it's just me, but I'm not inclined to trust the judgment of something that can't keep this kind of thing straight, which I know is to some degree a matter of having all the needed info in the context window. But maybe it would be able to do that if it didn't waste tokens telling me to cd into the same directory that I'm already in every 2 minutes, or chmod .ssh/ again, or (when it really needs to burn some tokens) blow away the .venv and pull a bunch of modules again just to "start clean".
A groundbreaking study led by Stanford Law School Professor Julian Nyarko reveals that law professors overwhelmingly prefer AI-generated answers to student questions over responses written by their fellow instructors—a finding that could reshape how legal education is delivered.
The study, titled “Law Professors Prefer AI Over Peer Answers,” was conducted with 16 law professors across U.S. law schools and tested whether large language models could serve as effective tutors for contract law courses.In a blind evaluation of nearly 3,000 anonymized comparisons, professors rated AI responses significantly higher than answers written by other professors, with AI winning 75% of head-to-head matchups.
“This study challenges important assumptions about AI’s role in legal education,” said Nyarko, who leads Stanford Law School’s Legal Innovation through Frontier Technology Lab, or liftlab. He co-authored the paper with colleagues from Yale, NYU, University of Chicago, and other leading institutions. “We focused on law precisely because it requires judgment, nuanced reasoning, and the ability to navigate ambiguity—not just factual recall.”
The study is particularly notable because previous AI evaluations have focused primarily on subjects with clear right-or-wrong answers. Legal reasoning, by contrast, demands careful analysis of competing arguments and defensible conclusions.

Stanford Law Professor Julian Nyarko
“We were frankly surprised by the magnitude of the results,” Nyarko added. “These weren’t just simple questions with obvious answers. Many of them required synthesizing complex material, applying it to new situations, and explaining legal concepts in ways that would help students develop their own analytical skills.”
Participants created 40 representative contracts law questions that students might ask after class or during office hours, wrote their own answers, and then evaluated responses without knowing whether they came from AI or other participating professors. The AI systems performed comparably to the best human instructor in the study.
Perhaps most striking: professors flagged AI responses as pedagogically harmful only 3.5% of the time, compared to 12% for peer-written answers.
“In most fields where AI gets tested, there’s a right answer. In law, there often isn’t.” said Sarath Sanga, co-author and professor at Yale Law School. “Two opposing arguments can both be good. What we wanted to know is whether AI can meet the latent professional standard that lawyers use to evaluate each other’s arguments. In this case, the answer was yes.”
The research team took extensive precautions to ensure the study’s validity. They calibrated AI responses to match the length and structure of human answers, used multiple evaluation methods, and had professors assess whether responses might mislead or confuse students.
“We designed this study to be as rigorous as possible because the stakes are so high,” Nyarko explained. “Legal education is about training future lawyers to think critically, argue persuasively, and navigate ethical complexities. Our study makes important steps towards finding out whether AI could support that mission.”
Alejandro Salinas, first author of the study and a researcher at Nyarko’s liftlab, emphasized the educational implications: “Our study shifts attention to what AI tutoring can contribute to learning in judgment-rich fields like law. We find that, when evaluated by legal educators, AI tutors can offer high-quality, on-demand support that complements classroom instruction, and may broaden access to expert guidance.”

The study also examined specific AI models, including commercial tutoring systems and Google’s NotebookLM, finding varying levels of performance. However, even when context limitations affected AI responses, professors still frequently preferred them to human-written alternatives.
The findings arrive as law schools nationwide grapple with integrating AI tools into legal education while maintaining rigorous academic standards. Some institutions have embraced AI experimentation, while others remain cautious about potential risks including hallucinations, overreliance, and the erosion of critical thinking skills.
“Our study evaluates the quality of answers given by AI tools. But how to implement these tools to most effectively improve student learning is still an open question. So we’re not advocating for wholesale adoption of AI tutors,” Nyarko cautioned. “But our data suggests that blanket skepticism may be equally unwarranted. The conversation should shift from whether AI can give accurate, high quality responses to how we can deploy it responsibly to the benefit of our students.”
View the Publication
Link to SSRN
Liftlab is among the first academic efforts in legal AI to unite research, prototyping, and real-time collaboration with industry. Its mission is to increase access to high quality legal services in the private sector by leveraging AI and other frontier technologies. To bridge the gap between theory and practice, liftlab’s work extends beyond conceptualization and encompasses the building of prototypes that help explore the utility of AI-based solutions.
Stanford Law School is one of the world’s leading institutions for legal scholarship and education. Its alumni are among the most influential decision makers in law, politics, business, and high technology. Faculty members argue before the Supreme Court, testify before Congress, produce outstanding legal scholarship and empirical analysis, and contribute regularly to the nation’s press as legal and policy experts. Stanford Law School has established a model for legal education that provides rigorous interdisciplinary training, hands-on experience, global perspective and a focus on public service.
so extrapolating from that, in another two years it will continue to bamboozle
If you have 100 responses from 1 professor, and the AI wins 75% of the time that is very likely a true signal that the AI is better than this prof. It would be incorrect to generalize this to all profs though.
Further, if you sample 16 profs and the AI beats 10 of them you can be fairly certain that the real percentage of profs it beats isn't 10%. Further, when estimating the probability that the AI beats a random prof, it's the relative estimation error that scales with 1/sqrt N. If you have a coin and it lands heads up 16 times, that tells you something quite robust about the coin.
Reasonably estimating confidence intervals at small N and high p is not trivial. But it can be done.
A good heuristic is "add 2 successes and 2 failures" which is due to Agresti & Couli.
See down the page here for source papers:
https://en.wikipedia.org/wiki/Binomial_proportion_confidence...
If the only purpose of asking a lawyer is transferring risk (aka cover your ass) while getting the same advice as an LLM, that’s slowing down delivery for purely bureaucratic reasons.
I’ve seen that mentality at big companies where everyone is scared to stick their neck out and be accountable for a decision. And nothing gets done. Drives me crazy.
But the people who move up are the people who take ownership and get shit done (and are right a lot).
(BTW, I have been at companies that were sued by regulators. They never really punish the individual(s) who were in the room when the decision is made. So your worry is kind of misplaced.)
You don’t become a billionaire because you aren’t committed to making a number go up far after you no longer have any significant unmet needs. He’s spending his life focused on business deals because that’s what he cares most about — if his true love was science, philanthropy, etc. he’d have been able to do that full time a couple decades ago.
Those services were usually just based on NLP + simple decision trees, and people actually won their cases.
Of course, doing huge corporate contract disputes, IP disputes, M&A, and whatever will probably be out of question for a good while. Same with more serious criminal cases where the stakes are very high.
But I think there's potential for automating away less serious cases, especially where there's good structure.
And of course, it all depends on what kind of legal system one is situated in. Immediately I'd think that Civil Law would be easier for AI lawyers, as its inherent structure is a better fit for machine reasoning. So I'd expect to see more AI products start in Civil Law countries.
The fact that Lexis and WestLaw have such an iron grip on the entirety of the US legal system is exactly why general LLMs are completely unequipped to be useful in this domain.
So yes, we can say the LLM created bad code when it does not compile or fails prewritten tests.
But experts might disagree what good comments, good cohesion, appropriate use of design patterns, appropriate test coverage or clear variable names are.
So what are we suppossed to train the LLMs towards? Somebody still has to decide what "good" is.
It seems to me like it would be more difficult to achieve with legal documents and, in my experience at least, writing a concrete plan has been the decisive factor that make my AI coding robust (plus all that you mentionned).
And in my experience if you do actually pay a lawyer for something they will act like you're not worth their time and will literally role their eyes at you when you're trying to explain the minor details of a case because they are too lazy to listen and zone in like I would when doing my job.
It's bit like with doctors, you'll want a second opinion, if you can afford it.
Naive question from an outsider: aren't there searchable databases of cases (with complete text) so that citations could be checked automatically, either by the same or an independent agent?
The point is that if the study can't validate the claims being made then we can't actually extrapolate from that claim. What you're predicting may or may come true, but the study (which is the topic at hand) isn't useful for supporting the assertion.
The knowledge cut off gap means the models sometimes don't know about the most recent case-law, in a given situation.
I've seent his happen multiple times now. Accountants and legal professionals advising clients based on outdated information assembled through chat-gtp, claude and copilot.
Professionals drafting letters and missing recent case-law which handles their exact case. It's unreliable.So it can save you some work; but it can't save you all of the work. And in some cases its mistakes really force you to redo all the work, and more, to be thorough and have confidence in the result.
The "biggest problem" being the one thing that is trivial to verify against concrete databases is a bit convenient don't you think?
I think it's more likely that it makes mistakes evenly but the one thing that you are able to check with certainty is the only place you discover the errors.
IF the right questions are asked, and IF steered into and corrected at a few crucial points. IF not it goes off in the wrong direction really quick and that's a problem that's still mostly unsolved in the last 2 years.
And that can be catastrophic in high risk environments, like legal, medical or high risk software products where being wrong in the wrong place can mean bankruptcy or even cost a life.
I help run a few marketing websites where I let the CEO's run crazy with Claude cowork, they are making PR's like a madman, but they are not allowed to touch any of the API's & platforms where there is real user data & sensitive information.
There’s also the fact that they can’t possibly keep improving frontier models at the same rate (I.e. training investment) when investment starts slowing down. The amount of cash being burned is completely unsustainable and you’re already seeing some pullback.
> I think this is probably true for most skilled professions.
I agree, BUT I also find that it's easy for experts to atrophy quickly. When the AI is right 80/90% of the time it lulls you into over confidence.I find those that are best and make the greatest use are the ones who remain skeptical but also use the tool. The same people who were already nuanced and picky before AI. The same people who already doubted and questioned their own work, and used that suspicion to help prevent them from having over confidence in their own work. If you weren't willing to just "lgtm" with your own code, it's difficult to do that with AI.
(To be clear, I'm not saying perfectionists. Some might call them that because the picky people have higher standards, but a good expert has to also understand that perfection doesn't exist. That's often a driving force in the suspicion! This also tends to cause them to continually improve)
Not saying we should take such studies as the "gospel truth" ... but if you ignore them and only consider "proper" studies, you'll be waiting a very long time to learn anything new.
i think devs overestimate their own role and underestimate others
i am seeing lawyers and doctors roll out their own software with AI
but we dont have their training and experience
I would imagine it's similar in law, in that it takes a lawyer or judge to know where the foot guns lie.
Yet that is exactly what a lot of C-Suiters (many of whom are lawyers), are doing.
you can get away with anything
Probably for important deals, detailed human review will be expected.
Maybe the real value-add will be the insertion of language that LLMs won't be able to figure out, but which will be favorable for the side that inserted them.
Even the good ones will not step above and beyond what they are paid to do
but an AI ? it will and can go above and beyond
The legals system is structurally based around manipulating text and its relations. It seems to me that the entire legal industry is the ideal use case for LLM's to take over.
Of course the legal system can gatekeep forever by design.
A bit of extrapolation from the study, but not a crazy stretch.
[1] https://www.legifrance.gouv.fr/
[2] https://legal.thomsonreuters.com/en/westlaw/plans-and-pricin...
But they can perform live websearches or go directly to a DB specified.
It's not like self driving cars where better than a human 80% of the time isn't good enough and they aren't really usable until its 95%, 99% etc.
But it might be that the optimization target itself has a ceiling. If you're training toward human approval ratings from a broad population, you converge toward what median preference selects for. The plateau is baked into what you're measuring against.
Thinking the AI is right 80/90% of the time is already a sign of being lulled into overconfidence. The actual percentage is much lower in my experience. I'm willing to grant the AI is "somewhat right" that often but is that really what we settle for?
Am I secretly the only person who ever actually cared about being very accurate. Is AI just an excuse everyone else is using so they can stop pretending? This is so incredibly frustrating
> If you weren't willing to just "lgtm" with your own code, it's difficult to do that with AI.
If you are willing to do that with your own code you should probably not be trusted to work on software
I just wish people would take a step back and think about the timescales here. Language Models are Unsupervised Multitask Learners was in 2019. Here we are seven years later and LOOK AROUND. The landscape is unrecognizable. It's worth thinking about who, in those seven years, had an accurate estimate of the future and whose estimate fundamentally failed. And just as it is valuable to note where propaganda about progress speeds past where we are, we should remember that it is costless to announce that at some unspecified future time all of this will settle down and things will go back to the way they were.
The danger of those mistakes creeping in also grows exponentially the farther a lawyer strays from their core legal expertise. There are a few statutes I know inside and out, and I can spot LLM analytical errors related to them in a split second, but once I venture out into domains where I am not an expert (but where I am nevertheless reasonably qualified to practice), it becomes much harder to spot drafting mistakes because I have not refreshed my own understanding of the law by reviewing the relevant cases or statutes as I would when drafting the analysis myself from scratch.
We have to settle for 'crumbs'?
Why would you say this like it is true?
Other then AI companies, a more realistic option are state funded universities (particularly in Europe and east Asia) which have consumer protection agencies who’s purpose is to protect their residents from corporate greed, and as such should fund, commission, or even conduct such studies. They also have enough money to do this properly.
If there is enough money for propaganda, there should also be enough money for the truth.
In my opinion, the main thing we need to do is have training happen continuously. And probably more real world data (from sensors).
Which also happens with humans – does it do so at a lower rate? On its own, it kind of sounds like similar anti-self-driving-car arguments.
Every new model might not be a leap like it used to be, but give it enough time and improvements add up.
One thing I learned, just bite the bullet and re-write the whole fucking will instead of making riders.
Piecing the will together from riders was terrible. Al the clauses fell away everyone got older. The final will could have been 8 pretty clear pages.
The other part that is hard is just knowing all of the things that happen with assets and a passing. Luckily we had another lawyer and financial folks to advise us. It was still a lot and not that easy to find details. This was pre-ai that would have helped walk through his shit.
e.g., https://www.npr.org/2026/04/03/nx-s1-5761454/penalties-stack...
can't get more foot gun than "well according to [fiction] it is a well established practice (that the defendent is guilty)"
For murder that's not such a huge deal because the statutes are typically easy to track down and don't really differ all that much substantively, but once you get really into the weeds on something like commercial contracts it can be a huge pain to do cross-jurisdictional research.
And that's just a tiny, super obvious example of how impenetrable statutory law is, which isn't even the really pernicious problem. Case law is infinitely worse. It makes me absolutely furious how difficult legal research still is. The Westlaw/LexisNexis duopoly is a moral crime and wildly destructive to the quality of government in this country. Every single written court opinion should be publicly available for free on the internet in an easily searched format. It would cost practically nothing to achieve. We're talking about less text than Wikipedia hosts. Yet still many states make it almost impossible to access case law. Even though these cases are law. Binding law that we are supposed to follow, yet we cannot even easily access. It's insane, and largely perpetuated by the complacency of lawyers who can charge others for what should be free, the lobbying of the duopoly, and the incompetence of politicians.
If all of the laws were consistently available and stored in reasonable, consistent citation formats (I would settle for hyperlinking as a replacement for the rat's nest of wildly varying jurisdiction-specific citation systems), it would even be possible to introduce a form of unit testing for legal drafting that would allow us to automatically verify if the LLM hallucinated a citation.
It also doesn't help that we (for what were at the time very good reasons) moved away from the system of legal writs that used to provide fairly standardized, almost "cut and paste" templates for legal filings. So now every legal document (filings, memos, contracts, court opinions, statutes) is drafted like a bespoke, artisanal creation with few strict structural or stylistic conventions. That makes automated interpretation much harder than it needs to be.
There's been a lot of news stories about lawyers using AI, and then getting in trouble for citing hallucinated laws or cases. It doesn't matter if the AI response is "preferred" over the human one if it gets thrown out when put under the scrutiny of a real case.
Make sure to use a deterministic pipeline or harness to go step by step so agents aren't checking their own work and I sometimes get alpha from having a codex check the work of a clod but I am seeing pretty good output across multiple domains when I have three independent quality gates and a loop which only spits it out to a human if it doesn't converge at a reasonable cost.
IDK "not any of it" seems a bit strong, especially thinking towards 2028. For a lot of knowledge professions, there is a surprising amount of tasks that are just dumb work compared to the rest.
But it depends on the skill:
- For landing pages & simple saas solutions: marketeers & founders have more skill, since they understand the user best. The real skill is not the basic coding, but understanding the market.
- For security risks/architecture: senior devs can spot things in seconds
Im not a doctor or lawyer, but im sure there are cases where AI is really good in a similar way and cases where they miss the most crucial aspects.
I agree that you can create a set of domain specific rules, reinforcement layer validation tools, like self driving, that vastly improves the accuracy of au & llm's. Making humans less and less needed. But where LLM's comes from the magic of generic knowledge, this will be the opposite, narrowing it down.
Out of curiosity, why would you love to be wrong about that? What possible outcome could you see being a net positive for society if the vast majority of knowledge workers (and ultimately, as robotics progress, most workers in general) are replaced by AI?
For example, my sister is a translator and she says that checking AI translations is actually harder in many ways than doing a translation in the first place, but the agencies pay less for checking than actual translation.
I mean thats what is wanted by some companies.
The problem, especially for things like legal is that it requires someone more skilled to read through and understand that the argument is bollocks, or the law/precedent they are banking on is in fact the right one.
We have a tool that auto-writes letters to our management companies when they break SLAs. We have a slider that goes from polite to we are going to extract your first born.
Thats simple ish to do for LLMs, and low risk.
Drafting contracts is also something we could probably do, as its mostly boilerplate. However the consequence for mis-drafting a contract is multi-million dollars.
Context is still a large limiting factor, and we have band aids around that area already. And the further along we go the further distributed LLMs get in terms of additional pieces.
As for the original article and sentiment I'm sure AI will be a boon for law. It's going to be much easier for the general consumer / person / small business to represent themselves which feels like a win. The downside is I feel like we're tracking towards a digital hell of "virtual lawyers" that will be at the whim of any org. Consumer laws really need to change now to help avoid this dystopian path we're on.
The further we get into this, the more AI feels like 3-D printing. Significantly bigger and will be more widely used for sure. But nowhere near the “new industrial revolution” that all these companies are making it out to be
Mixing them, is, not, in my experience, OK. In the future, I am sure that LLMs will reach the point, where their output will be beyond reproach, but we're not there, yet.
That means that someone that knows the context and content, needs to vet the output, before sending it on.
I think on the contrary, LLM providers accumulate huge logs of interaction with their users, which elicit that tacit knowledge and mine it and humans cooperate willingly in order to solve their tasks. Just imagine the corpus of sessions for scientific research, education or software development, it is probably the largest such collection ever to exist. Trillions of HITL tokens per day flow into those logs, carrying our perspectives, choices, original ideas and tacit knowledge. I call this the "human-AI experience flywheel". It's the new stackoverflow, next model generation is based on interaction data from previous one.
My favorite example of this is knowing how to untangle a big pile of cables. There are robots now which can untie a single knotted cable, but I don't think any can do a pile of cables yet. https://www.youtube.com/watch?v=vp-94rsherE
Median household net worth is in fact somewhere in the $100k-200k range, which is definitely something that could be meaningfully called an "estate." (Most of this tends to be the house, the median net equity in which is about $190k as of 2022).
Source: https://www2.census.gov/library/publications/2024/demo/p70br...
[1] This doesn't mean "homeowners," rather it's a recognition that assets for married or cohabitating couples are usually commingled.
But I could also see a world where that, too, is fed to models for hyper-local results.
Could be a way off, but I could see it.
An "estate" is a legal term for property, assets, and liabilities a person leaves behind upon their death. A family member is a top practitioner in the field of estate planning and resolution, and some of the messiest estates they have handled are pro-bono cases of exactly the type of people you would put in italicized "most people": poor, not really able to upkeep a house they inherited from a relative which hadn't had title properly transferred on a previous death because they didn't have money for an attny, now can't get a loan to fix the roof...
Yeah, if you are homeless, carless, and have only the clothes on your back and a shopping cart of stuff, you don't have an estate. Everyone in the middle class in the US has an estate. Much of the time it passes automatically to their spouse on death, but it's still an estate.
And if you are concerned about where it goes, get a GOOD attny. There are many bad ones hanging out their shingle as "Trust & Estate" attnys, and some of the next messiest cases are fixing problems made by those not-so-good attnys.
And NO, AI is not good enough.
I have no doubt that you're right, but will it be because they are close to infallible or because we have let ourselves become lazy and reliant?
My money is on lazy and reliant based on the trends I'm actually seeing
I get that you might have a 'UBI/alternative general welfare is impossible' up your sleeve, but you've written this like it's somehow unfathomable that not forcing everybody to work just to survive would be a good thing. Of course it would be good! It's just a matter of dealing with the (huge) side effect of lost income.
Believe it or not, some people actually do enjoy their jobs and work they do.
> I get that you might have a 'UBI/alternative general welfare is impossible' up your sleeve, but you've written this like it's somehow unfathomable that not forcing everybody to work just to survive would be a good thing.
UBI absolutely is unfathomable here (US). The USG won't even give people health care. People go bankrupt to afford life saving care on a regular basis. Or just die... Even if those cases are a minority, just the fact that it happens says a lot. So I do think it is unfathomable that UBI would be implemented here. I don't think that's unreasonable to say.
It is not hard for me to imagine a world where if my bosses didn't need me, they would prefer me to be dead than to pay me some kind of permanent income to me. They would prefer to keep that power to themselves
These are already the sort of people who will happily lay you off into a recession, leave you without a way to pay your rent or for food if it improves their bottom line. They do not care if you starve. Or at least they care less than they do about their quarterly bonus
So no, I don't trust these fucks to continue playing nice if they view my value as going to zero
Somewhat related to that -- I was just this weekend watching a YouTube essay about PTSD in knights back in the medieval times, and the main point made in the video is that the psychological impacts incurred by the knights after battle were not just from seeing fucked up shit... the most apparent and serious cases of "PTSD" occurred when a knight was injured enough on the battle field resulting in them no longer able to be soldiers. Their entire purpose in the world got stripped away resulting in serious psychological stress. I think that same issue would apply to many people today (lawyers, engineers, investment bankers, etc) who would no longer be able to practice their craft. (This is the video for reference, was a good watch https://www.youtube.com/watch?v=849dmdc-Qf8)
I understand the counter argument to this is going to be some anti-capitalist rhetoric like "Well people shouldn't live to be workers and that's fucked up that they have live that way!" but IMO, some people like what they do and don't want to be made useless. (Not implying that is what you were insinuating, but just in a broad sense I that genera of argument doesn't make sense to me)
Ultimately they are clearly here to stay but I think they are going to be incredibly important in some industries and minimally present in others (a glorified chatbot/summarizing tool for instance). Whatever form it takes it’s definitely not going to be a model where individuals have subscriptions they pay for monthly.
exactly my point to compare it with pre-iPhone mobile market: wide (and growing fast!) adoption, clear potential (WAP websites, J2ME games), many players in the game, some real market fit discovered already (Blackberry), influx of capitial and tinkerers alike, but still a lot of unknowns where it will ultimately land.
Even if no single improvement was revolutionary (even first iPhone was just a fancy phone without App Store), overall mobile made billion dollar industries possible, for better or worse, and changed the way we live. Counts as industrial revolution, comparable to the Internet itself in my eyes.