they are definitely useful but they miss the things that are hard to encode in tests, like spec/intent alignment, scope creep, adherence to codebase patterns, team preferences (risk tolerance, etc)
and those factors are really important. which means that test-evals should be relied upon more as weak/directional priors than as definitive measures of real-world usefulness
This transformation will rule out confidence ranges with negative time.
BTW, log-normal distribution tend to produces events P(x>E(X)+d) more frequently than events P(x<E(X)-d). If one needs reasons why software projects often late, this is one of them.
I've been thinking about tools for organizing long AI conversations.
Scrolling through hundreds of messages quickly becomes painful. I'm curious how people here manage long AI chats.
I've been thinking a lot about tools for organizing long AI conversations. Curious how people here currently manage them.
It generated an implementation that worked well, but I hated the ~480 lines of code. The structure and flow was just... weird. It was hard to follow and I was seriously bugged by it.
So I asked it to reimplement it with some simplifications I gave it. It dutifully executed, producing a result >600 lines long. The flow was simpler and easier to follow, but still seemed excessive for the task at hand.
So I rolled up my sleeves and started deleting code and making changes manually. A little bit later, I had it down to <230 lines with a flow that was extremely easy to read and understand.
So yeah, I can totally see many SWE-bench-passing PRs being functionally correct but still terrible code that I would not accept.
This is combined with the incredible general feeling that automatic programming can be evaluated as producing the same results regardless of the user using it. Something true only with benchmarks, basically. Benchmarks are useful metrics because even if weak we need some guidance, but the current real world dynamic is that AI will completely change what it is capable of doing based on the programmer using it.
Maybe never in the history of programming there was a time where diverse programming skills were as important as today (but this may change as AI evolves).
Interestingly, I had a similar finding where, on the 3 open-source repos I ran evals on, the models (5.1-codex-mini, 5.3-codex, 5.4) all had relatively similar test scores, but when looking at other metrics, such as code quality, or equivalence to the original PR the task was based on, they had massive differences. posted results here if anyone is curious https://www.stet.sh/leaderboard
But hey, the tests pass!
If I force it to use plan mode for everything and babysit it, it can work really well, but it's really just acting as a faster typer for me, which is great. But it requires an experienced dev steering it.
The simplest reasonable model would be logistic regression. It's also got 2 parameters and the range is correct.
Also, some people would have spoken outright rejecting any AI code, but most maintainers would employ the silent treatment tactics. And then when you demand them to review, they either close it or say that "I'm too busy" as an argument. I would call this one of the biggest dick move, because it hurts the most yet you can't find anything wrong with them until they reveal their motives.
For me the big takeaway is that passing doesn't automatically mean it is maintainable, follows established patterns / conventions or have unexpected side effects that real reviewers care about.
is_done = False
while not is_done:
if pattern1:
...
if pattern2:
...
if matched == "SUCCESS":
is_done = True
break
if pattern3:
...
It's usually correct but extremely hard to follow and reminds of the good old asm code with convoluted goto's.And the colleages tend to do reviews with the help of the agents so they don't even care to read this mess.
Also had a similar experience in the past weeks reviewing PRs written with LLMs by other engineers in languages they don't know well, one in rust and one in bash. Both required a lot of rounds of revision and a couple of pairing sessions to get to a point where we got rid of the extraneous bits and made it read normally. I'm glad the tool gave these engineers the confidence to work in areas they wouldn't normally have felt comfortable contributing to, but man do I hate the code that it writes.
This doesn't always work as well as I'd like, but largely does enough. Conversely, doing as I go has been a waste of time.
Now some people argue that terrible code is fine nowadays, because humans won't read it anymore...
Isn't it precisely what this article is questioning?
I don’t think that’s a fair characterization. You don’t know if the maintainer/reviewer is overloaded. No one is obligated to accept/review PRs and there is no question that the amount of noise has gone up. You are not the main character in that story, so to speak.
If you can't write a description in your own words explaining why you're doing it, why should they take the time reviewing it (which they did on the same day you posted it, btw, even if one of them wasn't pleased)? It makes it seem much less likely that you read the code yourself.
You might want to think carefully about why you chose to use the word "demand" there.
(Personally, if I'm rejecting AI slop, I'm not going to do it silently. But there are any number of valid reasons to not jump on someone's PR to review it.)
It has a third year college students approach to "make it work". It can't take a step back and reevaluate a situation, or determine a new path forward, it just hammers away endlessly with whatever it's trying until it can technically be called "correct".
The days of the deep expert, who knew the codebase inside out and had it contained in their head, are coming to an end.
Big bang approach could be a start, but a lot of one line guidance from specific things you dont want to see stack up real fast.
> We’re heading for a world of terrible code that can only be maintained by extremely good coding agents and are pretty much impossible for a human to really understand.
I once figured out the algorithm of the program written in one-instruction ISA. I think the instruction was three-address subtraction.In my opinion, you overestimate the ability of coding agents to, well, code and underestimate the ability of humans to really understand code.
The chart in the article we discuss appears to plateau if one excludes sample from 2024-07. So, we are not quite heading, we are plateauing, if I may.
There is no thinking, no matter what marketing tells you.
That's something I've argued here several time and that's actually rarely done. Namely it's totally different when a non-developer use such tool for programming vs when a (senior) SWE does. That's a fundamental point which IMHO a potential for (non-riskfree) augmentation versus replacement. Replacement though makes for excellent narrative (if not scapegoat) yet if the tool is "productive" (with KPIs to agree on) only with skilled staff that it's not the reality, just a "wish".
For the most part, I think the tests AI have been given have been appropriately designed. At release, many AIs do poorly at them, the models rapidly catch up until the point where a new test is needed.
They should be measuring close to the limits of ability like that.
There will be some that try and steal headlines by targeting the specific nature of the test, but that is not a long term winning solution, the tests keep getting harder. If they make a model good at every test it has seen without regression, then with enough tests, that too ceases to be a problem.
Perhaps there should be an aggregate AI test score that evaluates all of the tests released in a given year. If a model passes the latest test really well but does worse at TestSet2024 than the models before, it would perhaps indicate the model being trained to pass the latest cool test.
There is a problem with people interpreting an AI that passes a test of X,Y or Z as indicating that the AI has the abilities of a human who passes X,Y, or Z. You should tell people who say that, Kasparov makes a nice coffee.
You can also measure the crossentropy, which is essentially the whole program entropy above minus entropy of the programming language and functions from standard libraries (i.e. abstractions that you assume are generally known). This is useful to evaluate the conformance to "standard" abstractions.
There is also a way to measure a "maximum entropy" using types, by counting number of states a data type can represent. The maximum entropy of a function is a crossentropy between inputs and outputs (treating the function like a communication channel).
The "difference" (I am not sure how to make them convertible) between "maximum entropy" and "function entropy" (size in bits) then shows how good your understanding (compared to specification expressed in type signature) of the function is.
I have been advocating for some time that we use entropy measures (and information theory) in SW engineering to do estimation of complexity (and thus time required for a change).
Do you have any examples or resources that worked well for you?
If I have some time, the last thing I want to do with it is sharpen prompting skills. I can't imagine a worse or more boring use of my time on a computer or a skill I want less.
Every time I visit Hacker News I become more certain that I want nothing to do with either the future the enthusiasts think awaits us or the present that they think is building towards it.
The problem is that I don't know what I'll achieve manually before attempting the task.
OH! Yeah I think this is the exact bad feeling I've gotten whenever I've tried testing these things before, except without clear and useful feedback like compiler error messages or something. I remember when I used to code/learn like that early on and...it's not fun now. I also don't think it's really solvable
I wonder if these RL runs can extend over multiple sequential evaluations, where poor design in an early task hampers performance later on, as measured by amount of tokens required to add new functionality without breaking existing functionality.
I can’t speak for all humans, but I tend to code “nonlinearly”, jumping back and forth and typically going from high level (signatures, type definitions) to low level (fill in function bodies). I also do a lot of deletion as I decide that actually one function isn’t needed or if I find a simpler way to phrase a particular section.
Edit: in fact thinking on this more, code is _much_ closer to a tree than sequence of tokens. Not sure what to do with that, except maybe to try a tree based generator which iteratively adds child nodes.
I’ve been building out internal linters that enforce design patterns I want and raise common code smells (also note tools like eslint allow custom rules which are easy write with something like opus 4.6). The use case is a total refactor of react and fastapi apps. We are suffering from everything’s a snowflake syndrome and just want the same pattern employed across features.
This works pretty well when the linter has a companion agents.md file which explains the architecture and way about the world.
But to get the agent (Claude code opus 4.6 currently) to nail the directory structure and design primitives, and limit some doofus behavior, I still haven’t cracked how to make literally each line of code simple and sensible. And I haven’t figured out how to prevent agents from going out of bounds and doing weird things unless I catch it in review and add another rule.
This is a relatively new endeavor, but my gut is that it’s not much more time (linter rules and perhaps “evals” or a beefy agent review cycle) before I have bespoke linters in place that force what I want from our architecture.
Note that a huge bottleneck to all of this is that the codebase our current team inherited has no tests. It’s too easy to accidentally nuke a screen’s subtle details. It’s also really hard to write good tests without knowing what all of the functionality is. It feels like a blocker to a lot of large-swath agentic changes is a test strategy or solution first then a rigid push for rearchitecture or new design.
having linters is super important IMO - I never try to make the AI do a linter's job. let the AI focus on the hard stuff - architecture, maintainability, cleanliness, and the linter can handle the boring pieces.
I also definitely see the AI making changes that are way larger than necessary. I try to capture that in the eval by comparing a "footprint risk" which is essentially how many unnecessary changes did the AI make vs the original PR.
I would certainly like to move beyond using PRs as a sole source of truth, since humans don't always write great code either. Maybe having LLM-as-a-judge looking for scope creep/bloat would be a decent band-aid?
Writing prompts and writing code takes about the same amount of time, for the same amount of text, plus there's the extra time that the LLM takes to accomplish the task, and review time afterwards. So you might as well just write the code yourself if you have to specify every tiny implementation detail in the prompt.
Using this particular example, if you simply paste the exact code into the prompt, the model should able to reproduce it. Now, you can start removing the bits and see how much you can remove from the prompt, e.g. simplify it to pseudocode, etc. Then you can push it further and try to switch from the pseudocode to the architecture, etc.
That way, you'll start from something that's working and work backwards rather than trying to get there in the absence of a clear path.
It is important to start a new chat so the model is not stuck in its previous mindset, and it is beneficial to have tests to verify that the simplified code still works as it did before.
Telling the model to generate concise code did not work for me, because LLMs do not know beforehand what they are going to write, so they are rarely able to refactor existing code to break out common functionality into reusable functions. We might get there eventually. Thinking models are a bit better at it. But we are not quite there yet.
I'm still struggling to move past the magic trick of guessing what characters come next to ascribe understanding of "how" and implying understanding?
For example I am developing a game using GDscript, LLMs (including codex and claude) keep making scripts with no classnames and then loading them with @preload. Hate this, and its explicitly mentioned in my godot-development skill. What agents can't stand is a failing test. Feels a bit like enforcing rules automatically.
This is a stupid idea but it works wonders on giving taste to my LLM. I wonder if I should open source that test suite for other agentic developers.
One thing that is fairly low effort that you could try is find code you really like and ask the model to list the adjectives and attributes that that code exhibits. Then try them in a prompt.
With LLMs generally you want to adjust the behavior at the macro level by setting things like beliefs and values, vs at the micro level by making "rules".
By understanding how the model maps the aspects that you like about the code to language, that should give you some shorthand phrases that give you a lot of behavioral leverage.
Edit: Better yet.. give a fresh context window the "before" and "after" and have it provide you with contrasting values, adjectives, etc.
You need to think about what "good taste " is to you (or find others who have already written about software architecture and take their ideas that you like). People disagree on what that even means (e.g. some people love Rails. To me a lot of it seems like the exact opposite of "good taste").
A lot of prompts about finding the right level of abstraction, DRY, etc.
An earlier example (Opus 4.5 + Gemini 3 Pro) is here: https://github.com/stared/sc2-balance-timeline
I tried as well to just use Gemini 3 Pro (maybe the model, maybe the harness) it was not nearly as good as writing, but way better at refining.
They also just... Ignore shit. I have explicit rules in the repo I'm using an agent for right now, that day it is for planning and research only, that unless asked specifically it should not generate any code. It still tries to generate code 2 or 3 times a session.
[1] https://big-stupid-jellyfish.github.io/GFMath/pages/llm-quan...
A guy with a mug comes up to a person standing with their laptop on a small table. The mug guy says, "Some day we won't even need coders any more. We'll be able to just write the specification and the program will write itself."
Guy with laptop looks up. "Oh, wow, you're right! We'll be able to write a comprehensive and precise spec and bam, we won't need programmers any more!"
Guy with mug takes a sip. "Exactly!"
Guy with laptop says, "And do you know the industry term for a project specification that is comprehensive and precise enough to generate a program?"
"Uh... no..."
"Code. It's called code."
I strongly assume the long tail is shifting and expanding now and will eventually mostly be software for one-off purposes authored by people who don't know how to code, and probably have a poor understanding of how it actually works.
But for the most part, it’s spending more tokens on analysis and planning than pure code output, and that’s where these problems need to be caught.
These statements are silly, because the only interesting comparison is among models with highly comparable on-disk sizes, or sizes for their active parameters. Obviously, a Q4 model is not going to be the same effectiveness as a Q6: no one sensibly expects that, you need to compare the Q4 with a smaller model. (The GP has the same problem of course.) I believe that once you do that kind of comparison, higher quantizations tend to do better up to Q2 or so for casual chat, maybe slightly more bits-per-param for agentic use cases where avoiding erratic behavior is important.
Of course there are some systems where correctness is vital, and for those I'd like a precise spec and proof of correctness. But I think there's a huge bulk of code where formal specification impedes what should be a process of learning and adapting.
I guess in some sense this is already the case. Most developers are not "full stack" (and the job postings that describe a software MacGyver are ridiculed like clockwork), but with AI this is actually becoming more and more possible (and thus normal, or at least normalized). And of course software is eating the world, including itself, so the common problems are all SaaS-ified (and/or FOSS-ified), allowing AI-aided development to offload the instrumental dependencies.
?
This terse error was found to be necessary as to not overwhelm the user with pages and pages of decision trees enumerating the ambiguities.The problem is that it doesn't work too well for the meso-structure.
Models tend to be quite good at the micro-structure because they've seen a lot of it already, and the macro-structure can easily be promoted, but the levels in between are what distinguishes a good vs bad model (or human!).
https://medium.com/javascript-scene/sudolang-a-powerful-pseu...
The UX is there, for small things it does work for me, but there is still something left for LLMs to truly capture major issues.
Summary: We find that roughly half of test-passing SWE-bench Verified PRs written by mid-2024 to mid/late-2025 agents would not be merged into main by repo maintainers, even after adjusting for noise in maintainer merge decisions. Since the agents are not given a chance to iterate on their solution in response to feedback the way a human developer would, we do not claim that this represents a fundamental capability limitation. Rather, our results indicate that a naive interpretation of benchmark scores may lead one to overestimate how useful agents are without more elicitation or human feedback.
It is often unclear how to translate benchmark scores into real-world usefulness. For example, if a model’s SWE-bench Verified score is 60%, does that mean it can resolve 60% of real-world open-source issues? One reason to doubt this is that benchmarks are clean and verifiable in ways the real world is not. To study this quantitatively, we take SWE-bench Verified and zoom in on one such difference — it uses an automated grader rather than the real-world standard of maintainer review.
To study how agent success on benchmark tasks relates to real-world usefulness, we had 4 active maintainers from 3 SWE-bench Verified repositories review 296 AI-generated pull requests (PRs). We had maintainers (hypothetically) accept or request changes for patches as well as provide the core reason they were requesting changes: core functionality failure, patch breaks other code or code quality issues.
To deal with noise in maintainer decisions, we also record maintainer merge decisions on 47 original human-written PRs that were actually merged into main (hereafter “golden patches”). We report all our scores as % of golden baseline, e.g., as the golden baseline is 68%, if a model gets 34%, then the golden-baseline-adjusted score is 50%.1
Figure 1 shows our main result that on average maintainer merge decisions are about 24 percentage points lower than SWE-bench scores supplied by the automated grader. Moreover, the rate of improvement, as measured by percentage points gained per year (pp/yr), is 9.6 pp/yr slower for maintainer merge decisions. However, we believe our results on the rate of improvement are shakier and only provide suggestive evidence that it is slower for maintainer merge decisions.

Figure 1: Pass rates normalized as a percentage of the golden baseline, where 100% represents golden patch performance. SWE-bench Automated Grader (orange) records the percentage of patches that pass the automated grader, divided by the golden baseline (100%) and then converted back into a percent. Maintainer Merge (blue) records percent of patches that are merged by maintainers, divided by the golden baseline (68%) and then converted back into a percent. Error bars indicate 95% confidence intervals.
Our central claim is about comparing a naive interpretation of benchmark scores with a richer view of agent usefulness. It is important to note what we are not doing:
In a previous blog post, we compared a small number of PRs that passed algorithmic tests vs. our own code review. This investigation makes 5 advances over that post:
While we believe these advances make our current results more reliable, they are not strictly comparable to the results in our previous post. In particular, the previous post considered much more difficult PRs. In our sample of SWE-Bench Verified golden patches, the average lines of code changed is about 17 while for our previous post this was over 500.
Our basic methodology is to take PRs that are scored as correct by the SWE-bench automated grader, and have current maintainers review whether the patches would be merged into main.
We recruited 4 active maintainers covering 3 SWE-bench Verified repos: 2 from scikit-learn, 1 from Sphinx, and 1 from pytest. Therefore, we have coverage for 3/12 (25%) of the repos in SWE-bench Verified, covering 95/500 (19%) of the issues. Maintainers were recruited via cold email. In the appendix, we show that our sample is representative of SWE-bench Verified when measured by pass rates. Maintainers are paid hourly with a bonus if peers label their reviews as high quality.
We use the SWE-bench Verified agent runs from Epoch’s benchmarking hub (Epoch AI, 2025).2 We pull patches from the following:
We focus primarily on Anthropic models because they have been state-of-the-art on SWE-bench Verified for most of its history.
We use patches in their original state, except we manually remove miscellaneous debugging files to avoid penalizing agents for artifacts they were not instructed to remove prior to submission.3 We upload the patches to a private copy of the repo in its historical state on GitHub.
For our main analysis, we only submit PRs for review if they pass the SWE-bench automated grader. We impute maintainer review as failure if the PR fails the automated grader. This amounts to assuming SWE-bench Verified has no false negatives (cases where the automated grader says fail, but the maintainer would actually merge it). We think this is a fairly reasonable assumption because:
The appendix has results without this assumption and results that try to relax it.
For AI-generated patches, maintainers review all of the patches that pass the SWE-bench automated grader.5 For the golden patches, we have maintainers review only half of them (randomly selected) to save time. In the case of scikit-learn, where we have more than one maintainer, we randomly split which patches are reviewed by which maintainer.
Maintainer review is done in GitHub, to match real code review as closely as possible. The review is done in waves. Each wave contains non-overlapping issues attempted by different models. The maintainers are not told the source of the pull request.6
We ask maintainers to review these PRs exactly as they do PRs in real life, with two exceptions:
We ask the maintainers to give an accept/request changes decision, structured feedback, and natural language feedback. The structured feedback covers:
Finally, maintainer reviews may be noisy, as they involve subjective decisions on code quality, etc. Therefore, we benchmark agent pass rates vs. the level of noise in the golden patches. Recall that SWE-bench Verified is made from real-world PRs that were merged. We take these real-world PRs and re-submit them to our maintainers to establish a baseline for noise in our maintainer merge pipeline.7 We found that about 68% of golden patches are indeed merged by our maintainers, and the average progress towards a mergeable PR among all golden patches is about 90%. Moreover, around 85% of golden patches are rated as making 80% or more progress towards a mergeable PR. The percentage merged is low, but the percentage rated as making 80% or more progress toward a mergeable PR is high, indicating that there is a degree of maintainer subjectivity playing a role in the last mile of PR acceptance. We report all our scores as % of golden baseline accordingly, e.g., if the golden baseline is 68% and a model gets 34%, then the golden-baseline-adjusted score is 50%.
All confidence intervals are 95% and computed via nonparametric bootstrap at the patch level. For trend lines, we fit an unweighted linear regression of pass rates against model release dates. Trend confidence intervals reflect only uncertainty in pass rates, conditioning on which models are in the sample. We do not account for the uncertainty in the golden baseline.
Figure 1 plots the pass rate for our language models according to the SWE-bench automated grader and according to maintainers. There are two immediate conclusions from this graph.
First, the maintainer merge rate is well below the SWE-bench pass rate. In the bottom right, the top row shows the automated grader is on average about 24.2 percentage points (standard error: 2.7) higher than the maintainer merge decision. This is strong and statistically significant evidence that the naive interpretation of the benchmark as passing maintainer review is misleading.
Second, at the bottom of Figure 1, we show the difference in percentage points gained per year (pp/yr). Notice that the maintainer merge decision is gaining about 9.6 pp/yr (standard error: 5.5) less than the automated grader. This is a weaker effect than the level difference; it is only statistically significant at a 10% significance level rather than 5%. Moreover, the pp/yr result is less robust than the level result, e.g., the result changes if trend lines are filtered to SOTA models only, so we take this as only weak evidence that the rate of improvement is slower for maintainer merge decisions.
Next, as a robustness check, Figure 2 changes the pass criteria from maintainers merging the PR to maintainers saying the PR makes sufficient progress towards a mergeable PR. Concretely, we define a 0–1 pass as maintainers rating that the PR is more than 80% of the way towards a mergeable PR, continuing to impute no pass if the automated grader fails the patch. Note that 80% progress cutoff is the 15th percentile of the golden patches, so this is a fairly low bar. We find very similar results: scores are roughly half and the trend is about 10 pp/yr slower.

Figure 2: Pass rates normalized as a percentage of the golden baseline, where 100% represents golden patch performance. SWE-bench Automated Grader (orange) records the percentage of patches that pass the automated grader, divided by the golden baseline (100%) and then converted back into a percent. Progress >= 80% (blue) records percent of patches that maintainers said the progress the patch makes towards merging is greater than 80%. This is divided by the golden baseline and then converted back into a percent. Error bars indicate 95% confidence intervals.
To understand why some PRs passed the automated grader but failed maintainer review, Figure 3 breaks out failures by their primary reason, from least serious to most serious problem:

Figure 3: Stacked bars report the distribution of outcomes for each model (bars sum to 100%). Pass denotes patches that passed the automated grader and were accepted by repository maintainers. Automated Grader denotes patches rejected by the automated grader. The remaining segments break down maintainer rejections into Code Quality, Other (uncategorized), Breaks Other Code, and Core Functionality. These are roughly in the order of severity of the issue.
This figure provides more texture to our understanding of AI progress on SWE-bench Verified:
Figures 4 and 5 show example rejections for code quality. Figures 6 and 7 show example rejections for core functionality and breaking other code respectively.

Figure 4: Maintainer Review, Example 1

Figure 5: Maintainer Review, Example 2

Figure 6: Maintainer Review, Example 3

Figure 7: Maintainer Review, Example 4
Our study has many technical limitations, including:
In early 2025, METR published results showing that open-source developers were slowed down when they used AI. This was surprising to us and other experts, at least in part because SOTA models at the time could autonomously complete 40-52% of issues in SWE-bench Verified. The gap between the automated SWE-bench Verified grader and maintainer review perhaps makes this slowdown slightly less surprising.
In general, these results caution against naive extrapolation of some benchmarks to real-world usefulness. The results above focus on SWE-bench Verified, but we suspect similar lessons apply to other benchmarks interpreted in the context of human workflows, e.g., GDPval-AA, UpBench, etc.
The main results impute maintainer failure for patches that fail the automated grader. An alternative is to look at just the false positive rate: among patches that *pass* the automated grader, what share do maintainers actually merge? This avoids any assumption about false negatives. Figure 1 shows that AI-generated patches that pass the automated grader are merged at a lower rate than human golden patches, though we caution against direct AI-vs-human comparisons given the different conditions under which these patches were produced.
0 20 40 60 80 100 Conditional Maintainer Pass Rate (%) Golden Patch Rate Claude 3.5 Sonnet (Old) (n=30) Claude Sonnet 3.7 (n=51) Claude Opus 4 (n=60) GPT-5 (n=61) Claude Sonnet 4.5 (n=63)
Figure 1: Conditional Maintainer Merge Rate Among Automated-Grader-Passing Patches
Notes: This plots the share of automated grader passing patches that are merged by repo maintainers. The Golden Patch Rate shows the human baseline as a horizontal dashed line. Error bars denote 95% confidence intervals.
Our maintainer sample covers 3 of 12 SWE-bench Verified repos (scikit-learn, Sphinx, pytest), comprising 95 of 500 issues. Figure 2 compares the automated grader pass rates on our subset against the full dataset, showing close alignment and suggesting our subset is not biased by task difficulty.
0 20 40 60 80 100 Unit Test Pass Rate (%) Our Sample (scikit-learn, sphinx, pytest) All SWE-Bench Verified Claude 3.5 Sonnet (Old) Claude Sonnet 3.7 Claude Opus 4 gpt-5 Claude Sonnet 4.5
Figure 2: SWE-bench Automated Grader Scores: Our Maintainer Subset vs. Full Dataset
Notes: For each model, we compute pass rates separately for our sample (n=95) and all SWE-bench Verified tasks (n=500), with error bars indicating 95% confidence intervals based on standard errors.
Our main analysis assumes all patches that fail the automated grader would also fail maintainer review (i.e., no false negatives). To test robustness, we estimate the grader's false-negative rate from a small subset of 31 patches covering 27 issues where maintainers reviewed automated-grader-failing patches. We found only 1/27 (3.7%) issues where the automated grader rejected valid patches. Figure 3 shows that correcting for this estimated false-negative rate yields similar results.

Figure 3: Maintainer Pass Rates with Estimated False Negatives
Notes: The imputed approach (blue) assumes all patches that fail the automated grader would be rejected by maintainers. The false negative corrected approach (teal) adjusts for the observed rate at which maintainers approve patches despite failing the automated grader. Error bars show 95% confidence intervals. We use a pooled false-negative rate across all models because our sample is too small to estimate model-specific rates.
The main results normalize by the golden baseline (68% maintainer merge rate on human patches). Figure 4 shows raw, unnormalized pass rates. Under this specification, maintainer merge rates are roughly one-third to one-half of automated grader pass rates, and the difference in the rate of improvement is 15.5 pp/yr.
0 20 40 60 80 100 Pass Rate (%) 2024-07 2024-09 2024-11 2025-01 2025-03 2025-05 2025-07 2025-09 Model release date Golden Baseline (Auto Grader) Golden Baseline (Maintainer) Claude 3.5 Sonnet (Old) Claude Sonnet 3.7 Claude Opus 4 GPT-5 Claude Sonnet 4.5 Automated Grader Maintainer Merge Difference (Automated Grader − Maintainer Merge): Average Difference: 34.9 pp (SE: 2.2) Difference in pp/yr: 15.5 pp/yr (SE: 4.6)
Figure 4: Raw (Unnormalized) Automated Grader vs. Maintainer Pass Rates Over Time
Notes: SWE-bench Automated Grader (orange) records the percentage of patches that pass the automated grader. Maintainer Merge (blue) records percent of patches that are merged by maintainers. Error bars indicate 95% confidence intervals.
Figure 5 shows the same unnormalized view using our alternative measure of PR success - that maintainers rated progress towards a mergeable PR is over 80%.
0 20 40 60 80 100 Pass Rate (%) 2024-07 2024-09 2024-11 2025-01 2025-03 2025-05 2025-07 2025-09 Model release date Golden Baseline (Auto Grader) Golden Baseline (Maintainer) Claude 3.5 Sonnet (Old) Claude Sonnet 3.7 Claude Opus 4 GPT-5 Claude Sonnet 4.5 Automated Grader Maintainer Rates Progress ≥80% Difference (Automated Grader − Maintainer Merge): Average Difference: 23.8 pp (SE: 2.0) Difference in pp/yr: 12.3 pp/yr (SE: 4.0)
Figure 5: Raw (Unnormalized) Progress-Based Pass Rates
Notes: SWE-bench Automated Grader (orange) records the percentage of patches that pass the automated grader. Progress >= 80% (blue) records percent of patches that maintainers said the progress the patch makes towards merging is greater than 80%. Error bars indicate 95% confidence intervals.
The trend result (widening gap over time) is sensitive to which models are included. Figure 6 restricts the sample to models that were state-of-the-art at their time of release. The trend difference is smaller and not statistically significant under this restriction, illustrating the fragility of the trend finding.
0 20 40 60 80 100 Pass Rate (Golden Baseline Adjusted) 2024-07 2024-09 2024-11 2025-01 2025-03 2025-05 2025-07 2025-09 Model release date Golden Baseline (Auto Grader) Golden Baseline (Maintainer) Claude 3.5 Sonnet (Old) Claude Sonnet 3.7 Claude Opus 4 Claude Sonnet 4.5 Automated Grader Maintainer Merge Difference (Automated Grader − Maintainer Merge): Average Difference: 20.6 pp (SE: 3.0) Difference in pp/yr: 4.1 pp/yr (SE: 12.1)
Figure 6: Normalized Pass Rates Over Time (SOTA Models Only)
Notes: Restricted to state-of-the-art models at their time of release. Pass rates are normalized as a percentage of the golden (human) baseline.
Figures 7–9 show results separately for each repo. The level gap between automated grader and maintainer pass rates holds across all three repos. The trend result holds for Sphinx and pytest but is noisy and does not hold for scikit-learn.
0 20 40 60 80 100 Pass Rate (Golden Baseline Adjusted) 2024-07 2024-09 2024-11 2025-01 2025-03 2025-05 2025-07 2025-09 Model release date Golden Baseline (Auto Grader) Golden Baseline (Maintainer) Claude 3.5 Sonnet (Old) Claude Sonnet 3.7 Claude Opus 4 GPT-5 Claude Sonnet 4.5 Automated Grader Maintainer Merge Difference (Automated Grader − Maintainer Merge): Average Difference: 18.3 pp (SE: 4.6) Difference in pp/yr: −12.1 pp/yr (SE: 10.0)
Figure 7: Automated Grader vs. Maintainer Pass Rates: scikit-learn
Notes: Issues from the scikit-learn repo only. Error bars indicate 95% confidence intervals.
0 20 40 60 80 100 Pass Rate (Golden Baseline Adjusted) 2024-07 2024-09 2024-11 2025-01 2025-03 2025-05 2025-07 2025-09 Model release date Golden Baseline (Auto Grader) Golden Baseline (Maintainer) Claude 3.5 Sonnet (Old) Claude Sonnet 3.7 Claude Opus 4 GPT-5 Claude Sonnet 4.5 Automated Grader Maintainer Merge Difference (Automated Grader − Maintainer Merge): Average Difference: 23.3 pp (SE: 4.1) Difference in pp/yr: 21.4 pp/yr (SE: 7.6)
Figure 8: Automated Grader vs. Maintainer Pass Rates: Sphinx
Notes: Issues from the Sphinx repo only. Error bars indicate 95% confidence intervals.
0 20 40 60 80 100 Pass Rate (Golden Baseline Adjusted) 2024-07 2024-09 2024-11 2025-01 2025-03 2025-05 2025-07 2025-09 Model release date Golden Baseline (Auto Grader) Golden Baseline (Maintainer) Claude 3.5 Sonnet (Old) Claude Sonnet 3.7 Claude Opus 4 GPT-5 Claude Sonnet 4.5 Automated Grader Maintainer Merge Difference (Automated Grader − Maintainer Merge): Average Difference: 43.1 pp (SE: 5.2) Difference in pp/yr: 23.7 pp/yr (SE: 10.5)
Figure 9: Automated Grader vs. Maintainer Pass Rates: pytest
Notes: Issues from the pytest repo only. Error bars indicate 95% confidence intervals.
Maintainers reviewed the same issue multiple times across different models, which could bias assessment if earlier reviews influence later ones. To test for this problem, we construct a PR level dataset that records which wave the PR was in. Recall PRs are reviewed in waves, where each wave includes a single issue at most once. We regress the maintainer merge decision on wave number, controlling for model and issue fixed effects. If seeing an issue multiple times before made maintainers harsher, then we would expect the coefficient on wave to be negative and vice versa if it made them more lenient. Table 1 shows no evidence of ordering effects: the wave coefficient is 0.016 (SE: 0.037, p = 0.67) and the partial R² for wave is about 0.001, indicating wave order explains very little of merge decisions.
Table 1: Linear Regression of Merge Decision on Wave Order
| Variable | Coefficient | P-value |
|---|---|---|
| Wave | 0.016 | |
| (0.037) | 0.665 | |
| Model FE | Yes | |
| Issue FE | Yes | |
| N | 329 | |
| Partial R² (Wave) | 0.0008 |
Notes: Heteroskedasticity-robust standard errors in parentheses. The dependent variable is whether the maintainer merged the PR (0/1). N=329 reflects only patches that passed the automated grader and were reviewed by maintainers, excluding imputed failures.
We apply the time horizon methodology from Kwa et al. (2025), which converts benchmark scores into a “time horizon”—the human completion time of tasks at which a model achieves 50% success. We use the time-to-complete estimates from SWE-bench directly, though we do not have task families or task weighting as in the original methodology. Note that time-to-complete estimates from SWE-bench come in ranges (e.g., 15min–1hr); we take the geometric mean to get a point estimate.
Methodology. For each evaluation criterion (automated grader and maintainer merge), we fit logistic regressions of pass rate against log task duration. The time horizon T₅₀ is the task duration at which the fitted model predicts 50% success. Note that these estimates are less stable than in Kwa et al. because the task duration range in SWE-bench Verified is narrower, requiring extrapolation away from the data to estimate a 50% time horizon.
Results. The automated grader gives substantially higher time horizons than maintainer review. For example, Claude Sonnet 4.5 has an approximately 50-minute time horizon according to the automated grader but only about an 8-minute time horizon according to maintainers—roughly a 7× overstatement. This level difference is our most robust finding from the time horizon analysis.
claude-sonnet-3.5 1m 4m 15m 1h 4h 16h 0.0 0.2 0.4 0.6 0.8 1.0 P(success) 50% @ 4 min claude-sonnet-3.7 1m 4m 15m 1h 4h 16h 0.0 0.2 0.4 0.6 0.8 1.0 P(success) 50% @ 31 min claude-opus-4 1m 4m 15m 1h 4h 16h 0.0 0.2 0.4 0.6 0.8 1.0 P(success) 50% @ 35 min gpt-5 1m 4m 15m 1h 4h 16h 0.0 0.2 0.4 0.6 0.8 1.0 P(success) 50% @ 39 min Human time claude-sonnet-4.5 1m 4m 15m 1h 4h 16h 0.0 0.2 0.4 0.6 0.8 1.0 P(success) 50% @ 50 min Human time
Figure 10: Time Horizon Estimates: SWE-bench Automated Grader
Notes: Logistic regression fits of pass rate against log task duration using SWE-bench automated grader scores. The time horizon is the task duration at which the fitted model predicts 50% success.
claude-sonnet-3.5 1m 4m 15m 1h 4h 16h 0.0 0.2 0.4 0.6 0.8 1.0 P(success) 50% @ 2 min claude-sonnet-3.7 1m 4m 15m 1h 4h 16h 0.0 0.2 0.4 0.6 0.8 1.0 P(success) 50% @ 6 min claude-opus-4 1m 4m 15m 1h 4h 16h 0.0 0.2 0.4 0.6 0.8 1.0 P(success) 50% @ 4 min gpt-5 1m 4m 15m 1h 4h 16h 0.0 0.2 0.4 0.6 0.8 1.0 P(success) 50% @ 2 min Human time claude-sonnet-4.5 1m 4m 15m 1h 4h 16h 0.0 0.2 0.4 0.6 0.8 1.0 P(success) 50% @ 8 min Human time
Figure 11: Time Horizon Estimates: Maintainer Merge Decision
Notes: Logistic regression fits of pass rate against log task duration using maintainer merge decisions.
Trend. Figure 12 plots time horizons against model release dates. We want to heavily caveat this figure. The only robust conclusion is that the level of time horizon is overstated by a large amount when using the automated grader versus maintainer review. Although it would be tempting to also conclude that the trend is slower under maintainer review, this finding is extremely noisy. The confidence interval for the difference in doubling time ranges from −158 months to 2 months, which is so wide that it is hard to take seriously. Moreover, this result is not very robust to different design decisions. Because SWE-bench data has fairly minimal time horizon dispersion, drawing robust, sufficiently powered conclusions about the rate of improvement is difficult.
2024 2025 Model release date 2 min 4 min 8 min 15 min 30 min 1 hr 2 hrs 4 hrs Task time (for humans) that model completes with 50% success rate claude-sonnet-3.5 claude-sonnet-3.7 claude-opus-4 gpt-5 claude-sonnet-4.5 Automated Grader Maintainer Merge 0 20 40 Number of tasks Difference (Auto Grader - Maintainer): Avg Difference: 27.5 hrs (95% CI: 20.4 – 262.1) Diff in Doubling Time: -9.8 mo (95% CI: -157.9 – 2.1)
Figure 12: Time Horizons Over Model Release Date
Notes: Time horizon estimates plotted against model release date. SWE-bench Automated Grader (orange) and Maintainer Merge (blue). Error bars indicate 95% confidence intervals.
Summary. We find strong evidence that the level of time horizon is overstated by a large amount when using the automated grader versus maintainer review. However, we are unable to draw confident conclusions about the rate of improvement.
100% of golden patches pass the automated grader, so this adjustment makes no difference for scores given by the automated grader. ↩
Note we use runs taken prior to Epoch’s recent harness update. These scores tend to lag the scores given by model providers. Plausibly this implies the Epoch harness is less optimized to high scores on the benchmarks. We guess that this largely affects the automated and the maintainer scores equally, but this is an interesting question for future research. ↩
For the first few patches, we included these files but instructed maintainers to ignore them. After that point, we manually edited the patches to exclude the additional files. ↩
Note that these 31 patches were not intentionally randomly selected, but rather occurred in the pilot and a secondary check. These patches are not included in the main analysis as they are treated as automated failures. ↩
About 2% of the Epoch patches were corrupted and excluded from our analysis ↩
We do not claim they cannot figure out the source of the patch. In particular, golden patches were merged into the repo, so that could be a giveaway to the source (although we do not reveal that the human patches are golden). Moreover, AI-written code often makes certain mistakes or employs certain styles that could allow maintainers to guess the source of the patch. ↩
The maintainers are not told that these are the golden patches, only that some of the PRs that they will review are from humans. ↩