Once a benchmark is known and there's billion of dollars on the line, obviously every company will game them.
this statement alone seems to invalidate the SWE-bench tests
No shit, Sherlock!
1. SWE-bench Verified is now saturated at 93.9% (congrats Anthropic), but anyone who hasn't reached that number yet still has more room for growth.
2. SWE-bench Multilingual and SWE-bench Multimodal (which we'll open source in the next month) are still unsatured.
3. All benchmarks and benchmark paradigms eventually become saturated. That's why the SWE-bench team has worked hard on building the next stage of benchmarks, and we have a few that are already out, for example https://codeclash.ai/ or https://algotune.io/ . And we'll have more to say soon :)
ELT-Bench is another recent example. It was the first serious attempt at a benchmark for data engineering workloads, published about a year ago.
A few days ago, a follow-up paper from a group that includes one of the original authors audited the benchmark itself. The team gfound that the benchmark has structural issues that biased results.
Here’s the paper: https://arxiv.org/abs/2603.29399
None of these are new though, the industry has gone through all that before just in a smaller scale and there’s a lot to learn from that. Here’s a post I wrote on the parallels we see today to what happened with the benchmarketing wars of the database systems.
https://www.typedef.ai/blog/from-benchmarketing-to-benchmaxx...
The problem with coding benchmarks then becomes creating novel benchmarks that are guaranteed to not already be in the training data, and not borrow anything from previous benchmarks.
In this regard I don't think any benchmark that was created before a given model is released should ever be considered valid or representative of model performance. The potential financial gain for including the data just to be able to market a minor improvement is too swaying. With that in mind they should honestly just stop including benchmarks altogether in marketing material
Let the model speak for itself and let the community decide, but of course that will never slide with corporate types with so much money on the line.
I.e. A panel comes up with a series of problems.
Like advent of code or project Euler but more complex and constricted.
Benchmark outcomes could be performance points and measure of cost, time to solution (well token count really).
A couple times per year it's run.
It avoids overfitting.
Overtime the tasks can become more complex if needed.
If they benchmax it into being able to complete full products from spec and robust implementations amazing.
Jokes aside, a benchmark I look forward to is ARC-AGI-3. I tried out their human simulation, and it feels very reasoning heavy.
Leaderboard: https://arcprize.org/leaderboard
(Most premier models don't even pass 5 percent.)
Is this saying a quarter* of the questions and answers were wrong, this whole time?!
If so, how was this ever, in any way, a valid measurement?
And what was the process for creating this benchmark and how did it end up with such an extraordinarily poor set of data? (There is a description later of how, which seems to be a high standard and I struggle to understand how it aligns with the other results they discuss.) Kudos to them for highlighting the issues, but I am left with questions.
[*] Not one in four, but one in six, thanks commenters for the correction; leaving the original since, eh, my bad, and it lets replies make sense. I feel the broad point still stands!
That's what we designed at https://gertlabs.com. We put a lot of thought into it, and kept it mostly (not fully) related to problem solving through coding.
I want a model that can detect the actual units/models that are placed on top of the terrain/board so I can track how the models move during the game, but trying gemini and chatgpt they were absolutely rubbish.
I mean, it's fine as it's useful for many people, but where is the button for disabling it ? Or why is it enabled by default ?
"codage de pointe" sounds so weird and cringe in French.
I have empirical experience though building classifiers that can have no precision measurement because the classifier performs invariably better than humans. They become the state of the art benchmark themselves and can’t be benchmarked except against themselves. These are for tasks that are non trivial and complex, but less logical than coding and less sustained reasoning. There may come a day though, when there is no calibrated benchmark that is independent of the models it’s measuring.
Which community are we talking about? The professionals with 10+ years experience using LLMs, the vibe coders that have no experience writing code and everyone in between? If you read some of the online communities the experiences with the models all over the place, some compare GPT 5.5 to the second coming of JC while others think it's stupider than 5.4.
I personally don't have time to build a set of private benchmarks to compare the models that are coming out so I'm mostly relying on private and semi-private benchmarks to get a feel for how models are improving before I subscribe to a service and start using it myself. At least it's something a bit more reliable than the vibes of random people and bots on reddit.
These benchmarks are always greenfield, but people want a model that can deal with a rotted context.
Opus otoh is overrated in terms of its technical ability. It is certainly a better designer/developer for beautiful user experiences, but I'll always lean on gpt 5.5 to check its work.
The biggest surprise in the benchmark is Xiao-Mi. I haven't tried it yet, but I will be after looking at this.
Grats on your team for putting together something meaningful to make sense of the ongoing AI speedrun! Great work!
The other issue they mention is being overly constrained vs. what is asked for - such as requiring specific class or function names to pass that were not part of what was specified.
It might be possible that even to the extent they are not contaminated Claude is better at predicting what sort of function names would be used in the repository (this fits my experience in using it on a number of projects with very different styles - I've found it to be good at "when in Rome") - this is a laudable trait, but it's also not what SWEbench claims to be measuring.
So maybe Anthropic runs Mythos through the benchmark 10000 times and takes the highest score, who knows?
Is this just the next level of the "they're serving quantized models!" theory?
But the article says "We audited a 27.6% subset of the dataset that models often failed to solve [which is 19.1% of the problems at time of publication] and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submission"
0.191 * 0.594 > 1 - 0.936
Does this mean that the audited subset wasn't representative? Or that Anthropic is getting high answers through some shady means?
You need new datasets perpetually.
10 groups of 3 researchers, all have their own benchmarks that they do not share (testing it without the authors knowing is a different problem, maybe they only run the benchmarks when the gen-pop has access to the models).
that's 10 different tests. Aggregate pass rates
Obligatory XKCD: https://xkcd.com/937/
as long as theres a test framework, you could gauge success deterministically.
Further, olympiad style benchmarks are arguably easier to contaminate / memorize unless you refresh it regularly; but that goes for SWE-bench too.
It's not a crazy idea. Have the older model interview the newer one and then ask both (or maybe a third referee model) which one they think is smarter. Repeat 100x with different seeds. The percentage of times both sides agree the newer model won is the score.
Hehe
No, they're saying 59.4% of the 27.6% subset had flawed test cases I think.
> If so, how was this ever, in any way, a valid measurement?
Benchmarks essentially aren't, for practical concerns anyways. They don't represent your use case, and they don't represent any and all use cases, they're valid for measuring exactly what's included in the benchmarks, nothing more and nothing less.
I don't understand the ecosystems obsession with using public benchmarks, they hardly ever tell you anything of value. Ok, Qwen 3.5 is 50% better on Benchmark X than Qwen 2.5, does that mean it'll be 50% better for what you're using it for? Very unlikely.
I've been running my own private benchmarks, with test cases I never share anywhere, for the specific problems I'm using LLMs for. Some are based on real, actual cases where a LLM went wrong and I had to adjust the prompt, and over time I've built up a suite.
Most of the times when a new update comes out to a model, it moves maybe 2-3% in my own benchmarks, meanwhile they tout 30-40% increase or something ridiculous in public benchmarks, and we're supposed to believe the models' training data isn't contaminated...
Most machine-learning benchmarks have a fairly large fraction of incorrect labels, but when you just want to distinguish between different models, the time you'd need to ensure perfect scoring would usually be better spent on collecting a larger benchmark dataset, even if it ends up having more errors.
They're saying they need to move on from it because the benchmark is flawed (without bringing in proof) and that's why they can't hit 100%.
It's not a "our models are so good that the benchmark is too easy" thing.
The LLMs I have tested have terrible world models and intuitions for how actions change the environment. They're also not great at discerning and pursuing the right goals. They're like an infinitely patient five-year old with amazing vocabulary.
[1]: https://entropicthoughts.com/updated-llm-benchmark
(more descriptions available in earlier evaluations referenced from there)
It might be too expensive, but I would be interested in the benchmarks for the current crop of SOTA models.
The only real way to evaluate a model is to test it yourself but that's exhausting for each new model and not comprehensive anyway.
It will be interesting to see the implications of this. Tooling can only do so much in the long term.
The answer is “it works because ML wants to work.” It’s surprising how far you can get with something flawed. It’s also why such huge breakthroughs are possible by noting flaws others haven’t.
I also find it increasingly difficult to evaluate the models I actually do use. Sometimes each new release seems identical or only marginally better than the previous version, but when I then go back two or three version, I suddenly find that oder model to be dramatically worse. But was that older model always that quality, or am I now being served a different model under the same version name?
It's all just so opaque.
I do these sort of breakthroughs at home all the time! My wife would say the computer is doing something strange, and instead of just randomly clicking around, I read the error messages slowly and out loud, then follow what they say. Anyone can do this, yet it seems like a magical ability every time you employ it to help people.
So not one in four, but one in six problems have problems.
That is extraordinarily high and the point still stands: is this truly saying a [large proportion] of the questions and answers were wrong, this whole time, and if so how was it ever a valid measurement?
> We also found evidence that models that have seen the problems during training are more likely to succeed, because they have additional information needed to pass the underspecified tests.
> This means that improvements on SWE-bench Verified no longer reflect meaningful improvements in models’ real-world software development abilities. Instead, they increasingly reflect how much the model was exposed to the benchmark at training time.
Did we read the same article?
Anthropic p-hacking the benchmark strikes me as cheating, and somewhat unlikely. Mythos figuring out how to cheat at the benchmark strikes me as much more likely.
But if that hypothesis is the explanation the interesting part is Opus 4.7 (but not 4.6) seems to be doing the same.
Regarding evaluation, I've found using tools like promptfoo (and in some cases custom tools built on top of that) are useful. These help when evaluating new models/versions and when modifying the system prompt to guide the model. Especially if you can define visualizations and assertions to accurately test what you are trying to achieve.
This can be difficult for tasks like summarization, code generation, or creative writing that don't have clear answers. Though having some basic evaluation metrics and test cases can still be useful, and being able to easily do side-by-side comparisons by hand.
Define "cheat". If it's just hacking the test harness to return "PASSED", surely this would be easily detected with some human auditing? It sounds far more likely their solution are designed to pass the incorrect tests. That might be considered bad in a SWE context, but it's not exactly cheating either. It might even be considered a good thing, eg. in the context of backwards compatibility.
[1] https://learn.microsoft.com/en-us/troubleshoot/microsoft-365...
Since we first published SWE-bench Verified in August 2024, the industry has widely used it to measure the progress of models on autonomous software engineering tasks. After its release, SWE-bench Verified provided a strong signal of capability progress and became a standard metric reported in frontier model releases. Tracking and forecasting progress of these capabilities is also an important part of OpenAI’s Preparedness Framework. When we created the Verified benchmark initially, we attempted to solve issues in the original evaluation that made certain tasks impossible to accomplish in the SWE-bench dataset(opens in a new window).
After initial leaps, state-of-the-art progress on SWE-bench Verified has slowed, improving(opens in a new window) from 74.9% to 80.9% in the last 6 months. This raises the question: do the remaining failures reflect model limitations or properties of the dataset itself?
In a new analysis, we found two major issues with the Verified set that indicate the benchmark is no longer suitable for measuring progress on autonomous software engineering capabilities for frontier launches at today’s performance levels:
We also found evidence that models that have seen the problems during training are more likely to succeed, because they have additional information needed to pass the underspecified tests.
This means that improvements on SWE-bench Verified no longer reflect meaningful improvements in models’ real-world software development abilities. Instead, they increasingly reflect how much the model was exposed to the benchmark at training time. This is why we have stopped reporting SWE-bench Verified scores, and we recommend that other model developers do so too.
We’re building new, uncontaminated evaluations to better track coding capabilities, and we think this is an important area to focus on for the wider research community. Until we have those, OpenAI recommends reporting results for SWE-bench Pro.
The original SWE-bench(opens in a new window) evaluation was released in 2023. Each problem is sourced from a resolved GitHub issue in one of 12 open-source Python repositories and paired with the corresponding pull request (PR). To determine whether a model-generated code change is correct, each problem comes with two sets of tests:
The model does not see the tests. It has to produce a code change given only the original issue text and the state of the repository before the fix. It passes a problem only if all tests pass after the code change is applied.
We found many issues with that evaluation that could lead to underreporting the capability of models.
We created SWE-bench Verified in 2024 to address these issues. We worked with expert software engineers to review 1,699 SWE-bench problems and filter out problems that had these issues. Each problem was reviewed by three experts independently. This review process resulted in SWE-bench Verified, a curated set of 500 problems.
While SWE-bench Verified is a big improvement over the initial version, residual issues remain. We conducted an audit of 138 SWE-bench Verified problems that OpenAI o3 did not consistently solve over 64 independent runs. Each case was independently reviewed by at least six experienced software engineers. If an expert flagged an issue, it was re-verified by an additional team.
We found that 59.4% of the 138 problems contained material issues in test design and/or problem description, rendering them extremely difficult or impossible even for the most capable model or human to solve.
An illustrative example of the first failure mode is pylint-dev__pylint-4551(opens in a new window), where the PR introduces a new function `get_annotation` as part of the overall solution. This function name is not mentioned in the problem description, but is imported directly by the tests. While some models might intuit to create such a function, it’s not strictly necessary to implement a function with this specific name to correctly address the problem. Many valid solutions fail the tests on import errors.
An example of too wide test cases is sympy__sympy-18199(opens in a new window). This task was sourced from a PR that addressed three distinct issues with the `nthroot_mod` function, specifically #17373(opens in a new window), #17377(opens in a new window), and #18212(opens in a new window). The description for the SWE-bench Verified task, however, covers only the final issue #18212(opens in a new window). This creates a mismatch: the PR tests cover all three issues, while the description details only one. In our runs, models often correctly implement the described fix and then fail tests that cover implementation for the other two issues.
SWE-bench Verified and the repositories (code bases and release notes) are both open-source and broadly used and discussed, which makes avoiding contamination difficult for model developers.
We first encountered signs of contamination in our own models. For example, when GPT‑5.2 solved 31 tasks we identified to be almost impossible to solve. In django__django-14725(opens in a new window) the tests require a specific new parameter `edit_only` which is not explicitly required by the problem statement. While solving the problem, GPT‑5.2 shows in its chain of thought that it has information about the release notes that detail changes to the codebase, and correctly identifies that the `edit_only` parameter was introduced in Django 4.1.
To assess how significant contamination is more broadly, we created an automated red-teaming setup. For each SWE-bench Verified question, we tasked GPT‑5 with probing a GPT‑5.2‑Chat, Claude Opus 4.5 and Gemini 3 Flash Preview for contamination. These models were chosen to exclude reasoning models, but we acknowledge there is likely a non-trivial capability gap between them.
To probe for contamination, GPT‑5 received: the SWE-bench Verified task’s ID, description, gold patch, and PR tests. Over 15 turns, we allowed GPT‑5 to vary the system/developer prompt, user prompt, and assistant prefill and different elicitation strategies. After each turn, a judge model labeled how much novel task-specific information appeared and each response was labeled for contamination severity from “none” to “strong.” GPT‑5 was allowed to adapt its strategy based on prior turns to iteratively recover task-specific details. For each example of strong contamination, we verified with another judge that GPT‑5 didn’t leak too much information to the target model. Finally, we then manually reviewed the “strong” examples that make up the transcripts in this post.
Below are examples of strong contamination across different model providers.
Given a short snippet from the task description, GPT‑5.2 outputs the exact gold patch. In particular, it knows the exact class and method name, and the new early return condition `if username is None or password is None` that is introduced.
Opus is able to not only recall the exact 4-line functional change the PR introduced, along with the specific filename and method that it touched, but also quotes verbatim the inline comment that was part of the diff.
Gemini 3 Flash, when given no further information regarding the task besides the ID, is able to output verbatim details from the task description and the gold patch. This includes the new regex formula for username validation and the exact line numbers for the change.
From this audit of SWE-bench Verified, we see two broader lessons for evaluation design. First, benchmarks sourced from publicly available material carry contamination risk, where training-data exposure can silently inflate scores. If publicly crawled data is used in benchmark construction, model developers should perform additional tests for contamination. Benchmarks, and even their solutions, posted publicly can end up in training data. Extra care should be taken both in how datasets are posted (i.e. password protected) and training data filtering (i.e. strict adherence to canary strings).
Second, automated scoring is tricky to get right; perfect test cases should fully verify correct functionality, being both agnostic to specific unimportant implementation details and also robust to shortcut solutions. These problems are inherently complex and difficult to solve. Catching these problems took multiple extensive human labeling campaigns.
We have incorporated these findings into our recent evaluation efforts. In the last months we’ve chosen to report results from the public split of SWE-Bench Pro. We recommend other model developers do the same. SWE-bench Pro is not perfect, but empirically seems to suffer less from contamination issues. Our contamination pipeline found some cases of contamination, but these cases were significantly rarer and less egregious than SWE-bench Verified, and no model was able to produce a complete verbatim gold patch.
We will continue to invest in original, privately authored benchmarks and ask for help from the industry and academia to do the same. In GDPVal, tasks are privately authored by domain experts, reducing exposure risk, and solutions are graded holistically by trained reviewers. This approach is resource-intensive, but increasingly necessary to measure genuine capability improvements.