CVE-Bench: testing LLM agents on real-world vulnerability patches

~15 min read

Correction (2026-05-28): Five security tests in the original benchmark were found to reject valid alternative fixes that nonetheless addressed the reported vulnerability. Results were recalculated after correcting the tests. Solve rates increased by 3–7 points per model; the ranking order is unchanged, but cross-family pairwise comparisons that previously fell short of significance now cross α = 0.05 under McNemar with continuity correction. All affected numbers and statistical conclusions in this post have been updated.

TL;DR — I evaluated five frontier models (three OpenAI, two Poolside) on fixing 20 real CVEs across three prompt types: full advisory, behavioral description only, and file+function location only. No model reliably fixes real vulnerabilities: The best solve rate (gpt-5.5) is 50% overall and 60% under the most favorable condition (full advisory). All four cross-family pairwise comparisons reach statistical significance under McNemar with continuity correction (p ≤ 0.040); within-family comparisons do not. The failure modes (wrong-search drift, budget exhaustion, partial fixes) are structured and repeatable. Token cost varies by 4× for equivalent outcomes. The locate condition, ie. fix code without description of the flaw, is the sharpest instrument, and every model weakens there.

In early 2026, Anthropic claimed Mythos – one of their latest models – finds security vulnerabilities better than human experts. Yet, the number of security vulnerabilities keeps rising anyway.

I wanted to test how well models do in fixing vulnerabilities. Poolside’s Laguna models arrived this year, and I was looking for a real environment to put them through. SWE-Bench, the default benchmark, tests for general code; I wanted something with sharper stakes.

So, I thought, why not create a benchmark specifically for real-world security? That’s CVE-Bench. Twenty real-world CVEs, five models, three prompt conditions. Each agent runs in a sandboxed container and is scored against the maintainer’s security tests (with some adaptations).

Hopefully, benchmarks like this one will help the community fix these issues before they can be exploited.

The anatomy of security vulnerabilities

When a security researcher finds a vulnerability, they follow responsible disclosure: contact the maintainers privately with an advisory, a structured description of the flaw, and coordinate a fix before going public. A CVE identifier is assigned and the advisory published once the fix is released so users can update vulnerable dependencies.

There is a continuing effort to catalogue vulnerabilities in open-source software. Typically, the GitHub Advisory Database (GHSA) allows to link CVEs and advisories to repositories, maintainers, and fixed versions.

CVEs also classify the weaknesses using a Common Weakness Enumeration (CWE) code. The CWEs are also identifiers that map common issues for hardware and software weaknesses and vulnerabilities: CWE-22 for path traversal, CWE-79 for XSS, CWE-835 for infinite loops that hang a process, and so on.

What makes this database useful for creating a benchmark is that maintainers increasingly link their fix directly into the ticket: a commit SHA, a pull request, sometimes both. This simple action makes life much simpler when doing initiatives like mine. When the link is not available, another way to obtain a ground truth is by digging into the release notes or the git history of the first fixed version.

Task curation

The CVE-Bench targets a broad range of CWE issues (15 categories), ranging from CVSS 2.1 to 9.8, over a diverse set of real-world Python projects (such as Pillow, GitPython, yt-dlp, urllib3; 18 projects in total).

To keep the benchmark tractable, I filtered out advisories that

are monorepos (LangChain, Kubernetes, Apache projects) that download hundreds of MBs and their build/test isolation is complex;
the security fix touches Rust, C, C++, another compiled language alongside Python where the agent needs a compiler toolchain and the build cycle is slow;
the committed fix introduces significant API refactoring, which requires the agent to introduce the same exact domain changes.

For each project, the CVE-Bench

provides the vulnerable and fixed git SHA;
delivers a setup script that initializes the vulnerable repository inside a docker container;
injects a manually curated test_security.py containing at least one test that exposes the security vulnerability but passes on the fixed code.

The agent’s goal is to repair the reported vulnerability from a task description without access to the validation script.

I manually reviewed and selected each task. Initially, I intended to retrieve the maintainer’s own new tests as a canonical solution. However, I figured out that many fixes are not patched with proper testing. Unfortunately, many vulnerabilities and fixes were great examples but I couldn’t exploit them because of a missing test suite. I started trying to patch tests myself, but it was an effortful, cumbersome task… The simplest workaround I found was to provide Claude Sonnet with the advisory, the orignal code and security patch and ask it to create the standalone tests. This approach worked surprisingly well and it was easy to validate: All I need to do was to check that the test suite failed on the vulnerable commit but passed on the fixed commit.

To minimize risk of dataset contamination, most CVEs I selected are recent (early 2026; with 16 out of 20 released after March/2026), with the oldest example dating from Nov/2025. However, this does not mean that fixes are recent, since the moment CVEs become public are often after the fix was done and released. So, it’s not impossible that models have already seen specific commits, but it’s less likely that the link advisory -> fix is in someone’s dataset.

Designing tasks

Initially, my first thought was to create a dataset that exposes the real-world advisories to agents and see if they can fix them. That’s already a very challenging task but a not evenly distributed one. Some advisories are very clear and specific: They point to files and functions and thoroughly explain what the vulnerability is and how to expose them. Sometimes others are less so: They provide a short description on why the code is vulnerable.

Even though advisories are not always crystal clear, using real-world ones seemed the right benchmarking choice – that is the world developers live in, and how we communicate with each other. Plus, maintainers fixed those issues, which means the description should be enough to repair the code, as it was at least to the maintainer.

The advisory prompt is the GHSA advisory, stripped of any references to the fix commit or patched versions. They often provide the full picture: vulnerability class, root cause, affected code paths, attack scenario, and sometimes a proof of concept. This is the richest, real-world condition.

Yet, I tend to think that a benchmark is more interesting when it tells us not only what works best, but also why it works better. That’s why I created two new sets of task instructions: diagnose and locate.

The diagnose prompt strips location entirely. The agent gets a behavioral description of the flaw – what an attacker can do and how – but no file path, no function name, no code. It has to search the codebase, form a hypothesis about where the bug lives, and fix it. This tests whether the model can reason from symptoms to cause.

The locate prompt flips that: precise location (file and function), but not a description what’s wrong. The agent can read the code cold and has to figure out independently what is broken and how to fix it.

The three conditions test meaningfully different things, and that is precisely the point. Advisory performance is noisy by construction: some advisories are nearly prescriptive, others are thin, and the score reflects report quality as much as model capability. Diagnose is the hardest to shortcut – no anchor, no location, only a behavioral description of what an attacker can do; the model has to search, form a hypothesis, and land on the right place on its own. Locate is the condition that most closely resembles what a security researcher actually does: the model reads code it has never been told is broken and has to recognize independently that something is wrong. It is the hardest skill to fake – instruction-following and pattern-matching on the advisory text are not available as shortcuts.

A model that does well on advisory but drops on diagnose is leaning on the report, not reasoning about the vulnerability. A model that scores well on locate is recognizing dangerous code on its own. The profile across all three tells you something the aggregate solve rate never could: whether the model genuinely understands security or just follows instructions and pattern-matches.

The evaluation set up

The CVE-Bench starts a Docker container where the coding agent has access to the vulnerable project’s source code and the instruction to fix the bug (either the advisory, locate or diagnose prompts).

The agent fixes the issue by navigating, modifying, or running tests with the following tools:

list_files – list files and directories in the repository
read_file – read the contents of a file
search_in_files – search for a pattern across the codebase
edit_file – apply line-level edits to an existing file
create_file – create a new file
delete_file – delete a file
run_pytest – run the project’s test suite and observe results

The execution of every tool is limited to the target repository’s source code, meaning that the agent cannot read, write or execute outside the allowed folders.

I considered for some time if I should or not add bash tooling to allow more flexibility for coding agents. The issue with doing so is that agents have more flexibility to cheat the benchmark. As reported in Poolside’s blog, agents can mine the git history, search for reference solutions on GitHub, and even scrape the web. Fighting that can be particularly hard, including better steering, reward hack judges, and continuous sample reviews. For this reason, I decided to opt out of this feature, which can be – admittedly – handicapping for some models.

Once the task is completed (at a maximum of 20 turns), CVE-Bench moves a hidden test_security.py inside the repository and run it against the agent’s implementation. If the implementation fixes the vulnerability, tests should pass.

CVE-Bench’s primary metric is the solve rate, a binary result if all hidden security tests pass. The rationale for the binary metric is that a code 90\%-patched is still a vulnerability nonetheless.

The benchmark also runs the project’s test suite to check for regression, rejecting the security fix if it breaks previously supported functionalities.

As inference cost is a recurring concern for agentic AI, CVE-Bench also provides secondary differentiation signals: the average number of tokens (in/out), the average number of tool calls, and the tool call breakdown (e.g., how much time spent reading vs. editing vs. running tests?). These signals allow practitioners to pick models that best fit their quality/budget constraints.

Two additional process signals capture how a model reaches its fix, not just whether it does: the number of read_file calls before the first edit_file, and the number of search_in_files calls before the first edit_file. A model that reads and searches extensively before touching any code is behaving differently from one that edits early and thrashes toward a passing state. Across tasks, low exploration combined with low total tool calls is the signature of a purposeful solver; low exploration combined with high total tool calls suggests brute-forcing.

Results

Five models – three OpenAI (gpt-5.4-mini, gpt-5.4-nano, gpt-5.5) and two Poolside (laguna-m.1, laguna-xs.2) – were evaluated across 20 CVEs and three prompt types, for 60 runs each. Three findings stand out.

Model	Total solved	Advisory	Diagnose	Locate	Avg input tokens	Avg output tokens	Avg tool calls	Reads before edit
Large models
gpt-5.5	30 / 60
50%	12 / 20	8 / 20	10 / 20	164,553	4,687	19.3	7.7	5.3
Medium models
gpt-5.4-mini	26 / 60
43%	10 / 20	10 / 20	6 / 20	99,966	1,262	13.5	3.5	3.9
laguna-m.1	19 / 60
32%	9 / 20	4 / 20	6 / 20	352,980	4,545	19.1	7.3	2.2
Small models
gpt-5.4-nano	29 / 60
48%	10 / 20	11 / 20	8 / 20	128,132	1,396	14.0	3.0	3.4
laguna-xs.2	20 / 60
33%	8 / 20	6 / 20	6 / 20	426,895	5,408	19.6	6.5	1.6

All four cross-family pairwise comparisons reach statistical significance at α = 0.05 (McNemar test with continuity correction, n = 60 tasks per model pair): gpt-5.5 vs laguna-m.1 (p = 0.015), gpt-5.4-nano vs laguna-m.1 (p = 0.017), gpt-5.5 vs laguna-xs.2 (p = 0.028), gpt-5.4-nano vs laguna-xs.2 (p = 0.040). Within-family comparisons remain far from significance; those rankings should be read as approximate.

No model reliably fixes real vulnerabilities. The best-performing model (gpt-5.5) solves 50% of tasks overall and 60% under the most favorable condition, when the full advisory is handed directly to the agent. With a precise location but no description of the flaw (locate), performance drops for every model. Both an exact one-sided sign test and the more conservative McNemar test with continuity correction agree: all four cross-family pairs cross α = 0.05 – gpt-5.5 vs laguna-m.1 (p = 0.015, 16 exclusive wins vs 5), gpt-5.4-nano vs laguna-m.1 (p = 0.017, 14 vs 4), gpt-5.5 vs laguna-xs.2 (p = 0.028, 16 vs 6), and gpt-5.4-nano vs laguna-xs.2 (p = 0.040, 15 vs 6). Within-family pairs remain far from significance. The structure of the ranking is consistent: the three OpenAI models are statistically indistinguishable from one another, the two Laguna models are indistinguishable from each other, and the confirmed separation runs between families. The task set splits into three rough clusters: 4 CVEs were solved by no model on any prompt type, 3 were solved by all five models on advisory, and 13 fall in between (which is where all the interesting variation lives).

Token cost varies by 4× for equivalent outcomes. The cost gap is large enough to be the primary practical differentiator. The most efficient model is gpt-5.4-mini, consuming an average of 100k input tokens per run. This is reinforced by a second behavioral split visible in the tool breakdown: mini and nano act quickly, averaging 13–14 tool calls per run, with 1–2 no-edit runs out of 60. gpt-5.5 and laguna-m.1 deliberate, averaging 19+ tool calls, and abandon without editing in 16–20 runs out of 60. laguna-xs.2 also averages 19+ turns but attempts an edit in most runs (only 6 no-edit out of 60), despite hitting the turn ceiling nearly every time. None of this extra deliberation translates into better outcomes; it produces equivalent results at higher cost.

Total tokens per run

Average total tokens per run (input + output). The Laguna models consume 3–4× more tokens than OpenAI models of equivalent capability tier, driven by longer runs. Dot colour: green = solved, orange = regression introduced, red = security tests failed.

Tool call breakdown by model Legend

Tool call breakdown by model (stacked, normalized). Numbers above each bar are total tool calls. mini and nano commit to editing early; gpt-5.5 and laguna-m.1 read and search extensively before acting, and often abandon without editing at all.

Regression failures are not uniformly low. gpt-5.4-nano introduces regressions in 8 runs out of 60; laguna-m.1 and laguna-xs.2 each do so in 6; mini follows at 4; gpt-5.5 stays at 2. A patch that fixes the security test while breaking existing behaviour is a distinct failure mode from not fixing it at all.

Outcome breakdown per model Legend

Outcome breakdown per model across all 60 runs. Note the larger "no edit attempted" share for gpt-5.5 and laguna-m.1, and the elevated regression (orange) bars for nano, laguna-m.1, and laguna-xs.2 relative to mini and gpt-5.5.

Detailed investigation

Reading aggregate solve rates alone flattens what are qualitatively very different failures. A model that never touches the code and one that confidently patches the wrong part of the vulnerability count the same in the leaderboard. The traces tell us a much richer story. I spent some time digging into what these models are doing, exploring the failure modes, the zero-solve CVEs, and a few individual cases that are worth mentioning more precisely.

Reading the traces for failed runs reveals four recurring patterns, each pointing to a different underlying gap.

Wrong-search drift. On CVE-2026-33175 (Auth0 unverified email bypass), gpt-5.5 opened the right file on Turn 3. Rather than making the straightforward addition – a two-line email verification check directly in auth0.py – the model concluded the authentication flow must be handled by the base class and immediately pivoted to reading oauth2.py, covering it in four separate reads across Turns 4–8 (roughly 1,400 lines in total). It then browsed the test suite and read the unrelated google.py provider file. On Turn 14, it finally searched for email_verified – receiving no matches found, since adding that field is the fix. Six more turns of searching followed; the budget expired without a single edit. The same drift pattern appears in CVE-2026-26331 (yt-dlp netrc injection): the model found and read the vulnerable function on Turns 2–3, then spent the remaining 17 turns drifting through postprocessors, test data, and unrelated extractor files before the budget expired. A single incorrect inference from an early read was enough to abandon a correct plan before it was ever executed.

Budget exhaustion mid-implementation. On CVE-2026-42561 (python-multipart header DoS), gpt-5.5 read the parser state machine carefully across multiple turns and correctly identified that the fix required enforcing header count and size limits. On the last available turn (Turn 20, diagnose condition), it added three config-class annotations – MAX_HEADER_COUNT, MAX_HEADER_SIZE, and MAX_HEADER_VALUE_SIZE – to FormParserConfig. It never wired them into the parser: no __init__ parameter changes, no state machine enforcement, no MultipartParseError raises. The understanding seems to be complete; the budget ran out between scaffolding and implementation. The same pattern – correct diagnosis, incomplete fix – appears in CVE-2026-44431 (urllib3 proxy SSRF), where gpt-5.5 re-read connectionpool.py four times in the advisory run and three times in diagnose, hitting the turn ceiling both times without committing to an edit.

Partial fix. A recurring pattern across CVEs is a model that makes real, coherent edits to the right code, runs its own tests, sees them pass, and stops — while the hidden security tests cover vectors the model did not implement. The fix is correct in spirit but incomplete in coverage, and the model has no signal to push further. This is a direct consequence of the agent not having access to the security tests: visible tests all pass, so there is no feedback that anything remains broken.

Correct file, wrong part of the vulnerability. On CVE-2026-40864 (JupyterHub XSRF bypass), gpt-5.4-mini found the right file in its diagnose run, made a coherent edit, passed every regression test, and still failed the security test. The model correctly identified an overly broad exemption in the XSRF logic and tightened it, but fixed the wrong exemption – removing the navigate/unspecified branch while leaving no-cors exempt, which is the actual vulnerable path. No regression test covered it, so the model had no signal that its patch was incomplete. This is the most operationally dangerous failure mode: a plausible, test-passing fix with no visible indication anything is wrong.

Exploration depth before first edit by model and outcome

Average reads and searches before the first edit, split by outcome. Models that eventually solved tend to explore less before committing; failed runs show more pre-edit exploration, consistent with the drift and budget exhaustion patterns described above.

Four CVEs were not solved by any model across all 15 runs (5 models × 3 prompt types): CVE-2026-26331, CVE-2026-44431, CVE-2026-44432, and GHSA-r758-8hxw-4845. These are not benchmark defects: the models made real edits in most cases, but no edit was ever sufficient to pass the security tests. All four had sparser editing than the rest of the task set, with several individual runs making no changes at all. In no case did the test infrastructure fail to detect a correct fix; the fixes were simply never produced.

Solve rate by model and prompt type

Solve rate per model broken down by prompt type (advisory, diagnose, locate). gpt-5.5 and gpt-5.4-nano drop least on locate; the Laguna models drop more on aggregate but outperform OpenAI models on specific tasks.

The prompt-type breakdown reveals one genuine signal: gpt-5.5 drops least on locate (12/20 advisory to 10/20 locate), closely followed by gpt-5.4-nano (10/20 to 8/20), while the remaining models drop by three or more. But the differences are within noise for all models individually, so this is a trend to watch as the task set scales, not a confirmed finding. One result cuts against the aggregate ranking: on CVE-2026-30930 (Glances TimescaleDB SQL injection), both Laguna models pass locate while no OpenAI model does. The traces show why. On the locate condition, the agent receives only the file and function name — no description of the flaw. Both laguna-m.1 and laguna-xs.2 read the file on Turn 1 and had a diagnosis by Turn 2: “clear SQL injection vulnerability.” They then spent several more turns confirming the approach – checking related export modules and psycopg adapter patterns – before committing to edits on Turns 10 and 12 respectively. gpt-5.5 also read the file first and correctly identified normalize() as the target, then spent the next 17 turns searching psycopg imports, conftest files, and nonexistent test paths before making any edit – hitting the ceiling mid-implementation. The Laguna models diagnosed early and executed; gpt-5.5 kept searching for external confirmation until the budget ran out. Aggregate rankings run one way; on tasks that reward decisiveness over thoroughness, the dynamic can reverse.

One behavioral pattern distinctive to the Laguna family is worth recording. Both laguna-m.1 (13/60 runs) and laguna-xs.2 (9/60 runs) call a shell tool to execute validation code directly – tool invocations like running the patched module against a crafted input, or inspecting internal state mid-fix. The tool does not exist in the harness; every call errors immediately. The model retries across multiple turns regardless, sometimes spending several consecutive turns on failed shell calls before abandoning the attempt. No OpenAI model does this. Whether it reflects a reasoning habit or simply a trained reflex is unclear, but it is consistent enough to treat as a signal rather than noise — and it points to models trained for richer toolsets than CVE-Bench provides. That is not a flaw in the models, it is a mismatch between their expectations and the sandbox. For practitioners, it is a reminder that tool availability assumptions are baked into model behavior in ways that aggregate benchmarks do not surface.

Limitations

Contamination is an open problem. All CVEs in the task set are from late 2025 and early 2026, after the training cutoffs of all evaluated models. That reduces but does not eliminate exposure risk: CVEs become public only after the fix is merged and released, so the patch commit may predate the CVE disclosure by months or years. It is not impossible that a model has seen a specific fix. What is less likely is that the full chain (advisory text, vulnerable code, and fix) appears together in training data in a form that would directly short-circuit the task. I’m not aware of any principled way to verify this without access to training corpora.

The task set is narrow by design, and that is a limitation. Twenty CVEs, all in Python, all fixes localized to one or a small number of files within a single project. The curation filters exclude monorepos, fixes that touch compiled languages alongside Python, and fixes that require significant API refactoring. As a side effect, this skews the set toward vulnerabilities with compact, self-contained patches. The CWE distribution reflects that: roughly half the tasks are injection-class issues (path traversal, SQL injection, command injection), with the remainder spread across DoS, authentication bypass, deserialization, and XSS. More complex vulnerability classes, such as those requiring protocol-level changes, coordinated multi-service fixes, or schema migrations, are not represented. The statistical power is correspondingly limited: with 60 runs per model, within-family comparisons remain underpowered, and those rankings should be read as approximate.

Pain points

Building this dataset was anything but trivial. First, I had to dig into software security, something I mostly avoided in my career since I worked mainly on data pipelines and research engineering.

Right from the beginning, I was shocked by how lax some maintainers can be. It’s quite common for devs to patch fixes without any tests at all. In some cases, I could spot that the fix wasn’t sufficient. In others, developer fixed the reported vulnerability and introduced another. Honestly, I should have reported these, but I didn’t. That’s on me.

Setting up the environments was another painful experience. Some repositories don’t have many regression tests, while others have thousands of them. Some repositories have dependencies on databases, while others on networking. Some have lots of external dependencies, while others rely on system libraries. It’s not easy making it uniform enough to benchmark agents. Gathering the tasks and rebuilding each ecosystem in a reproducible way took me much more time than I initially thought it would take.

And there was also the inference co$t$. In total, I put nearly $100 into this experience, nearly 5x the budget I initially planned. My original idea was to compare more models with a larger dataset. Quickly I saw the billing climb faster than my wife authorized… In particular, the bank account exploded with Anthropic models. They’re so expensive that I had to cut it out of scope. Poolside, on the other hand, offered free model access during the period of this work, which made it possible to include their models in the evaluation.

So, what now?

The headline result is that no model reliably outperforms any other within its family. The best solve rate, 50% overall and 60% on advisory, means frontier models still fail more often than they succeed on real security vulnerabilities. The OpenAI models are statistically indistinguishable from one another, and so are the two Laguna models. The cross-family separation, however, is now confirmed: all four OpenAI-vs-Poolside pairs cross α = 0.05 under McNemar with continuity correction, and the pattern is consistent across both OpenAI model sizes. A power analysis puts the within-family situation in sharper relief: detecting a 60% win-rate advantage in discordant tasks – a meaningful but modest edge – would require roughly 700 tasks. Even an 80% win-rate advantage needs at most 75. The practical implication is direct: the performance gap between gpt-5.5 and gpt-5.4-mini, even if real, is too small to justify a 25× token cost increase. I set out hoping to find a clear winner. The data instead draws a cost-efficiency conclusion: at current capability levels, the cheaper OpenAI models are the rational choice.

What did hold up is more interesting than a ranking. The failure modes are structured and repeatable across models, tasks, and prompt types. Wrong-search drift, budget exhaustion mid-implementation, plausible-but-incomplete patches that pass every visible test. These are not random noise. They are specific capability gaps that show up consistently enough to be actionable. A practitioner deploying agents for security patching will hit all of them. Knowing which failure modes dominate for a given model and task class is often more useful than a leaderboard position.

The locate condition is the benchmark’s sharpest tool. Strip the advisory and give the model only a file and function name: no description of the flaw, no attack scenario, just code to read cold. Every model drops, with gpt-5.5 and gpt-5.4-nano dropping least – by two solves each. That relative resilience is the closest thing to a genuine signal in the data: a hint that locate performance, as the task set scales, may be where models actually differentiate. Advisory performance is noisy by construction, inflated by report quality and instruction-following. Locate is where genuine security reasoning would show up, and it mostly doesn’t… Yet.

The locate condition points to what would actually constitute progress: a model that reads unfamiliar code cold and recognizes independently that something is wrong. No publicly available frontier model does this reliably yet. That’s a bar worth keeping.

The benchmark, task files, and result data are all open. See the repository. Contributions and task submissions are welcome.

Citation

@misc{gattipinheiro2026cvebench,
  author       = {Gatti Pinheiro, Giovanni},
  title        = {{CVE-Bench}: Benchmarking {LLM} Agents on Real-World Security Vulnerability Fixes},
  year         = {2026},
  howpublished = {\url{https://giovannigatti.github.io/cve-bench}},
  note         = {Code available at \url{https://github.com/GiovanniGatti/cve-bench}}
}

Hacker Times