A bug is a bug. A “potential vulnerability” is a bug. A vulnerability is verifiable as having security implications with a proof of concept or other substantial evidence.
Words matter. Bugs matter. It’s important to fix large amounts of bugs, just as it always has been, and has been done. Let that be impressive on its own, because it IS impressive.
Mythos didn’t write 271 PoC for vulnerabilities and demonstrate code path reachability with security implications. Mythos found 271 valid bugs. Let that be enough.
> As additional context, we apply security severity ratings from critical to low to indicate the urgency of a bug:
> * sec-critical and sec-high are assigned to vulnerabilities that can be triggered with normal user behavior, like browsing to a web page. We make no technical difference between these, but sec-critical bugs are reserved for issues that are publicly disclosed or known to be exploited in the wild.
> * sec-moderate is assigned to vulnerabilities that would otherwise be rated sec-high but require unusual and complex steps from the victim.
> * sec-low is assigned to bugs that are annoying but far from causing user harm (e.g, a safe crash).
> Of the 271 bugs we announced for Firefox 150: 180 were sec-high, 80 were sec-moderate, and 11 were sec-low.
Mozilla uses the term "vulnerability" for even sec-high, even though they say right below that it doesn't mean the same thing as a practical exploit. And on their definitional page, they classify even sec-low as "vulnerabilities" [2].
Words are tools, that get their utility from collective meaning. I'd be interested where you recieved your semantics from and if they match up or disagree with Mozilla.
[1] https://hacks.mozilla.org/2026/05/behind-the-scenes-hardenin...
[2] https://wiki.mozilla.org/Security_Severity_Ratings/Client
From what I understand, that is a recipe for getting quickly banned by commercial LLM providers?
Wired: Mozilla Used Anthropic's Mythos to Find and Fix 271 Bugs in Firefox (41 points, 18 comments) https://news.ycombinator.com/item?id=47853649
Ars: Mozilla: Anthropic's Mythos found 271 security vulnerabilities in Firefox 150 (33 points, 8 comments)https://news.ycombinator.com/item?id=47855384
Firefox is written in several languages, only about 25% of it is in C++ but every single one of these issues seems to touch the C++.
I don't understand much of this paragraph:
* "a crank they can pull that says: ‘Yep, this has the problem,’": as in, ring an alarm? Does the LLM ring th alarm?
* "you can iterate on the code and know clearly when you’ve fixed it": Isn't that true of most bugs, assuming you do the normal thing and generate a test case? And I thought the LLM output test cases itself: "It will craft test cases. We have our existing fuzzing systems and tools to be able to run those tests" And are they claiming the LLM facilitates iterating?
* "and eventually land the test case in the tree": Don't you create the test case before the fix? And just a few words earlier they seemed to be working on the fix, not the test case. And see the prior point about test cases.
* "such that you don’t regress it.”: How is the LLM helping here?
Maybe I'm missing some fundamental unwritten assumption?
This isn’t sarcasm. Firefox deserves to be used more. Most people I know don’t use it because “Chrome does almost everything better”, and Firefox can’t compete with the other browsers’ roadmaps.
It's better because it actually lists a sample of Bugzilla reports that were made public. This topic was discussed previously (36 comments two weeks ago: https://news.ycombinator.com/item?id=47885042), but the part about bug reports being made public is brand new.
The zero-days are numbered
I think the word you're looking for is exploit?
I wonder if these models will get good + cheap enough so that people rarely reach for static analysis.
I eventually left and wound up at Mozilla where there were a number of /* flawfinder ignore */ comments scattered throughout the code.
My guess is that Mythos just ignored the "flawfinder ignore" directives and reported the known vulnerabilities in the code.
New tools find new bugs, but the oligarchy newspapers report on Mythos and not on clang-22.0.
I'm not only talking about big things like
* Pocket,
* several major UI redesigns and
* the offline translations,
but even tiny useless things like
* browser.urlbar.trimURLs,
* putting the search query in the URL bar instead of the URL after searching from the URL bar,
* messing with the Edit and Resend feature for no reason (the good one that updates the content length is still available at devtools.netmonitor.features.newEditAndResend) and
* probably thousands of little shit like this that took a bunch of developer hours to implement.
All of the above should've been add-ons.
And of course, we know Mozilla spends a lot of money on things unrelated to Firefox at all. It's amazing Firefox is somewhat secure and stable compared to Chrome, which is backed by Google with their infinitely deep pockets.
In general, I would say that our use of "vulnerability" lines up with what jerrythegerbil calls "potential vulnerability". (In cases with a POC, we would likely use the word "exploit".) Our goal is to keep Firefox secure. Once it's clear that a particular bug might be exploitable, it's usually not worth a lot of engineering effort to investigate further; we just fix it. We spend a little while eyeballing things for the purpose of sorting into sec-high, sec-moderate, etc, and to help triage incoming bugs, but if there's any real question, we assume the worst and move on.
So were all 271 bugs exploitable? Absolutely not. But they were all security bugs according to the normal standards that we've been applying for years.
(Partial exception: there were some bugs that might normally have been opened up, but were kept hidden because Mythos wasn't public information yet. But those bugs would have been marked sec-other, and not included in the count.)
So if you think we're guilty of inflating the number of "real" vulnerabilities found by Mythos, bear in mind that we've also been consistently inflating the baseline. The spike in the Firefox Security Fixes by Month graph is very, very real: https://hacks.mozilla.org/2026/05/behind-the-scenes-hardenin...
For us this is substantial enough evidence to consider it a security vulnerability at that point, unless shown otherwise and it has always been this way (also for fuzzing bugs).
My only source for this is personal experience, and no, I can't share any evidence of it.
Security things are mentioned in the Release Notes [b] pointing to a completely different document [d].
Perhaps sometimes a bug is 'just' a bug, and not a vulnerability.
[a] https://bugzilla.mozilla.org/show_bug.cgi?id=2034980 ; "Can't highlight image scans in Firefox 150+"
[b] https://www.firefox.com/en-CA/firefox/150.0.2/releasenotes/
[c] https://bugzilla.mozilla.org/show_bug.cgi?id=2024918
[d] https://www.mozilla.org/en-US/security/advisories/mfsa2026-4...
https://hacks.mozilla.org/2026/05/behind-the-scenes-hardenin...
> eventually land the test case
This is just a reference to the fact that we don't land test cases for security bugs immediately in the public repository, to make it harder for attackers. You are right that the LLM only helps with creating the initial test case. Things like running the test case in automation is part of the standard development process.
That’s not evident in what you pastedat all.
What you pasted says
> sec-critical and sec-high are assigned to vulnerabilities that can be triggered with normal user behavior […] We make no technical difference between these […] sec-critical bugs are reserved for issues that are publicly disclosed or known to be exploited in the wild.
> sec-low is assigned to bugs that are annoying but far from causing user harm (e.g, a safe crash).
From this one infers that the "180 were sec-high" bugs found are actually exploitsble but known to have been found in the wild, and are NOT mere annoying bugs.
The difference between 180 and 270 does nothing to deflate the signicance, or lack there of, of the implication re: Mythos.
From what I can tell, a lot of these bugs were hardly C++-specific, they just happened in C++ code. Even the most secure Rust can't magically catch things like TOCTOU issues.
Totally agree. I even go as far as choosing which website I make purchases on depending if they work on FF, or writing to support occasionally to tell them it's not supported or a feature isn't working properly and this would be appreciated.
I know it pretty much always goes nowhere, but I feel it's what I can do to keep the browser somehow on the radar.
Well, depending on how you define "exploit"; some might only read arbitrary pointers or just out of bounds. Those would be useful primitives in a chain of vulnerabilities, not exploits themselves. You'll have to read through the first comments yourself, but if you're hoping that this is all nonsense and ignorable hype, you're going to be disappointed.
It is still unclear and open for speculation as to what percentage of all security bugs in Firefox today are being found by the AIs (as opposed to not being found at all). It might be that AI is very good at certain types of problems, even if we can't put our finger on what those types are, and that after the initial wave of bug reports the AI findings will slow to a trickle even while many many other bugs remain in the codebase. Or it might be that AI really does detect most instances of some class of problems and all those bugs will now be gone forever, never to return as long as Mozilla keeps paying the token monster. This is closely related to the oft-asked question "are we better or worse off after both attackers and defenders have access to this new capability?"
That's a really good use case for LLMs. It also applies to things like finding proofs in Lean and creating test stimulus. In both cases you know automatically whether the output is good, and it doesn't really matter if it isn't.
That isn't the case for most bugs, and definitely isn't the case for actually fixing bugs.
More tools for more people equals more stuff being made on a wider range.
That will make software safer alone.
But it also represents more easily available opportunities for blackhats to abuse against the projects where these tools were not being applied.
Translating things to Rust manually was already a thing before LLMs came into the picture. Now with LLMs that's only going to get easier and faster. The long term value is going to come from getting on top of the mountain of technical debt in the form of existing C/C++ code bases that is responsible for the vast majority of memory exploits, buffer overflows, and other issues that despite decades of attention still are being found across major code bases on a regular basis.
Mozilla finding these issues comes on the back of a quarter century of some very competent engineers trying to do the right thing and using all the tools at their disposal to prevent these issues from happening. I have a lot of respect for that team and the contributions it has made over the years to improve tools, testing/verification practices, etc. The issue is not their effort or competence.
The job of taking an existing system that is well covered in test, well documented/specified, etc. and producing a new one that can function as a drop in replacement is now something that can be considered. A few years ago that would have translated into absolutely massive project cost and risk. Now it's something you can kick off on a Friday afternoon. Worst case it doesn't work, best case you end up with a much better implementation.
It's still early days. There are still a lot of quality issues with LLM generated code. But the success/fail rate will probably improve over time.
Firefox/Mozilla tries literally anything to expand their feature set, customer base, or revenue stream? They need to stop spending money on that and instead spend money on the free product of theirs that I care about, in exactly the way I want.
Google surveilles the entire world, spends huge amounts on lobbying, degrades their own websites on other browsers? Not a peep, usually.
For my part, I pay mozilla for their VPN service, which I’m sure many here would decry as useless spending that should be going to firefox instead.
I'm genuinely curious what "types" of implementation mistakes these were, like whether e.g. it was library usage bugs, state management bugs, control flow bugs etc.
Would love to see a writeup about these findings, maybe Mythos hinted us towards that better fuzzing tools are needed?
But report [1] says that "Some of these bugs showed evidence of memory corruption...", which implies that majority of these (which includes 271 bugs from Mythos) don't have evidence at all. Do I not understand something?
> For us this is substantial enough evidence to consider it a security vulnerability at that point
Mythos is supposed to be pretty good at writing actual exploits, so (as I understand) there shouldn't be any serious problems with checking if bug is vulnerability or not.
[1] https://www.mozilla.org/en-US/security/advisories/mfsa2026-3...
Same with my wife, after I've explained things to her and she understood how different internet experience can be thats the primary browser.
So please don't put the argument like 'here is crappy underdog but please use it because monopoly is bad and google is a bit evil', its first class experience in everything I have ever thrown at it. Tripple that on mobile, by far the best mobile and useful mobile experience, bar none.
Part of the problem is, when they stop working on fixing bugs, they start doing Mr Robot things... We just want a web browser. Nobody asked for pocket, or AI...
If they use AI to fix all the bugs, then what else is for them to do, other than maintain syntax compatibility with the various languages they build with? They're just going to go back to making the browser trash again.
For instance, in one of the included bugs (2022034) it figured out that a floating point value being sent over IPC could be modified by an attacker in such a way that it would be interpreted by the JS engine as an arbitrary pointer, due to the way the JS engine uses a clever representation of values called NaN-boxing. This is not beyond the realm of a human researcher to find, but it did nicely combine different domains of security.
As the person responsible for accidentally introducing that security problem (and then fixing it after the Mythos report), while I am aware of NaN-boxing (despite not being a JS engine expert), I was focused more on the other more complex parts of this IPC deserialization code so I hadn't really thought about the potential problems in this context. It is just a floating point value, what could go wrong?
see https://www.blackduck.com/static-analysis-tools-sast/coverit...
and for Firefox-related alleged defects, see https://scan.coverity.com/projects/firefox
You have to create an account to view the actual reported defects.
There are just over 5000 reported defects still outstanding. I don't know how many overlap with the reported 271 Mythos-reported defects.
Sure, but, surely AddressSanitizer would also detect the same problem in the C or Rust which together also make up about 25% of Firefox so... ?
If Firefox starts acting maturely, we can stop having these conversations. Until then we see useless crap in every update while most bugs don't get any meaningful attention. Some changes even made things worse than they were before, for example the new Edit and Resend (not so "new" anymore). If Mozilla starts acting the best interest of the user, stops with the ad BS and doesn't try to be everything all at once and actually focuses on Firefox, I would donate. And so would others. If I donate now, I doubt even 1% of my money would go to anything meaningful, like bug fixing.
> Not a peep, usually.
No, fuck Google and Chrome and even anything Chromium-based. Here's the peep from me.
> For my part, I pay mozilla for their VPN service, which I’m sure many here would decry as useless spending that should be going to firefox instead.
Does the profit from the VPN service go to Firefox? If not, what's the point of having a VPN service.
That being said, I think there's a lot of potential for synergy here: if LLMs make writing code easier, that includes fuzzers, so maybe fuzzers will also end up finding a lot more bugs. I saw somebody on Twitter say they used an LLM to write a fuzzer for Chrome and found a number of security bugs that they reported.
In this particular sense, AI tends to find bugs that are closer to what we'd see from a human researcher reading the code. Fuzz bugs are often more "here's a seemingly innocuous sequence of statements that randomly happen to collide three corner cases in an unexpected way".
Outside of SpiderMonkey, my understanding is that many of the best vulnerabilities were in code that is difficult to fuzz effectively for whatever reason.
Much as GitHub calls everything an "issue" and GitLab a "work item".
(Don't worry- I use the system browser for any site I don't fully trust.)
I'm guessing a bit, but for example: out of bounds reads are not memory corruption. Assertion failures in debug builds are also usually not memory corruption, and I'd guess that many of these bugs were found through assertions. (Some parts of Firefox like the SpiderMonkey JS engine make heavy use of assertions, and that's the biggest signal used for defect validation. An assertion firing is almost always treated as a real and serious problem. Though with our harness, Opus and Mythos try to come up with an exploit PoC anyway.)
At Mozilla, but not everywhere: exploits are a subset of vulnerabilities are a subset of bugs.
If you look closely at, say, this patch, you might get a sense of what I mean (although the real cleverness is in the testcase, which we have not made public): https://hg-edge.mozilla.org/integration/autoland/rev/c29515d...
You get bug bounties if you report the kind of bugs Mythos identified. There's a reason no-one collected bounties from the "5000 defects" Coverity identified.
The Mythos reports have several examples of chaining a whole bunch of logic in different parts of the program together to exploit something very subtle. The Coverity reports aren't anything like that. These tools aren't remotely in the same league or even universe.
Using LLM coding tools to stay on top of static analysis tool output works very well and adding some guard rails that enforce that there are no issues is probably a good idea. Just like adding CI checks to make sure everything is clean.
As for false positives, it depends on the tool. I tend to avoid tools that generate mostly noise. Most of these tools allow you to disable rules if they produce a lot of noise. Or you can just tell the LLM to fix all the issues. When it's cheaper to fix things than to argue with the rule, just fix it. That used to be really expensive when you had to do that manually. Now it isn't.
I recently did this to an Ansible code base that I needed to refresh after not touching it for a few years. It had hundreds of ansible-lint issues; mostly deprecation warnings and some non fatal other warnings. And 10 minutes later I had zero. Mostly they probably weren't very serious ones but it's a form of technical debt. If you have to fix hundreds of warnings manually, you are probably not going to do it. But if you can wave a magical wand and it all goes a way, why not? I adjusted the guard rails so it now always runs ansible-lint and fixes any issues. Only takes a few seconds extra.
I maintain a static analysis tool using in Firefox's CI. False positives have to be fixed or annotated as non-problems in order for you to land a patch in our tree. That means permitting zero positives (false or true), which is a strict threshold. This is a conscious tradeoff; it requires weakening the analysis and getting some false negatives (missed bugs) in order to keep the signal-to-noise ratio high enough that people don't just ignore it and annotate everything away, or stop running it. Nearly all static analysis tools have to do this balancing act.
AI, as commonly used, is given more leeway. It's kind of fundamental that it must be allowed to hallucinate false positives; that's the source of much of its power. Which means you need layers of verification and validation on top of it. It can be slow, you'll never be able to say "it catches 100% of the errors of this particular form: ...", and yet it catches so much stuff.
Data point: my analysis didn't cover one case that I erroneously thought was unlikely to produce true positives (real bugs), and was more complex to implement than seemed worth the trouble. Opus or Mythos, I'm not sure which, started reporting vulnerabilities stemming from that case, so I scrambled and extended the analysis to cover the gap. It took me long enough to implement that by the time I had a full scan of the source tree, Claude had found every important problem that it reported. The static analysis found several others, and I still honestly don't know whether any of them could ever be triggered in practice.
I still think there's value in the static analysis. Some of those occurrences of the problematic pattern might be reachable now through paths too tricky for the AI to construct. Some of them might turn into real problems when other code changes. It seems worth having fixes for all of them now for both possibilities, and also for the lesser reason of not wanting the AI to waste time trying to exploit them. At the same time, clearly the cost/benefit balance has shifted.
They could also team up: if I relax my standards and allow my analysis to write an additional warnings report of suspected problems, with the clear expectation that they might be false alarms, then I could feed that list to an AI to validate them. Essentially, feed slop to the slop machine and have it nondeterministically filter out the diamonds in the rough.
Food for thought...
https://hacks.mozilla.org/2026/05/behind-the-scenes-hardenin...
I suppose it depends what the word "magically" means. A ToCToU race is because you imagined things wouldn't change but they did and in Rust you actually do write fewer patterns with this mistake because of the Mutable xor Aliased rule. If we have at least one immutable reference to a Goose then Rust isn't OK with anybody mutating the Goose, your safe Rust can't do that and unsafe Rust mustn't do that. So the ToCToU race caused by "Oops I forgot somebody else might change the Goose" is less likely because you were made to wrestle with this problem during design - the safe Rust where you just forgot about this doesn't compile.
So while Mythos certainly is real I think you could do the same with Deepseek pro, GPT 5.5 etc...
Think it's more a care of mythos raising widespread awareness that tireless LLMs can be weaponized to dig through code and find that one tiny flaw nobody spotted
The code before the patch does not look obviously wrong. Now, some more lines were added, but would you now say it now looks less obviously wrong, or more obviously correct?
It seems that the invariants needed here are either in some person's heads, or in some document that is not referenced.
Reading the code for the first time, the immediate question is: "What other lines might be missing? How can I figure?"
If the "obviously correct" level of the code does not increase for a human reviewer, how is it ensured that a similar problem will not arise in the future? Or do we need more LLM to tell us which other lines need to be added?
What is the point of keeping it private? I'd bet feeding this patch to Opus and asking to look for specific TOCTOU issue fixed by the patch will make it come up with a testcase sooner or later.
Of course, even the reports with flawed methodology could be suggesting that a great harness + weak model might achieve a similar level of results as a mediocre harness + strong model. But I'd want to see solid evidence for that.
This new post makes it pretty clear that this was all bolted on-top of their existing fuzzing infrastructure, and really just used to get more and better initial hits that a very skilled team is looking at. I assume Anthropic was giving them a very good deal on inference for the positive PR, but I believe these other reports and suspect Mozilla did not really need them.
Two weeks ago we announced that we had identified and fixed an unprecedented number of latent security bugs in Firefox with the help of Claude Mythos Preview and other AI models. In this post, we’ll go into more detail about how we approached this work, what we found, and advice for other projects on making good use of emerging capabilities to harden themselves against attack.
Just a few months ago, AI-generated security bug reports to open source projects were mostly known for being unwanted slop. Dealing with reports that look plausibly correct but are wrong imposes an asymmetric cost on project maintainers: it’s cheap and easy to prompt an LLM to find a “problem” in code, but slow and expensive to respond to it.
It is difficult to overstate how much this dynamic changed for us over a few short months. This was due to a combination of two main factors. First, the models got a lot more capable. Second, we dramatically improved our techniques for harnessing these models — steering them, scaling them, and stacking them to generate large amounts of signal and filter out the noise.
Ordinarily we keep detailed bug reports private for several months after shipping fixes and issuing security advisories, largely as a precaution to protect any users who, for whatever reason, were slow to update to the latest version of Firefox. Given the extraordinary level of interest in this topic and the urgency of action needed throughout the software ecosystem, we’ve made the calculated decision to unhide a small sample of the reports behind the fixes we recently shipped. We’ve attempted to draw them from a range of browser subsystems, but the selection process was still somewhat arbitrary. Nevertheless, we hope that the depth and diversity of these reports lends credence to our assessment of the capabilities and our calls for defenders to begin applying these techniques:
| Bug ID | Description |
|---|---|
| 2024918 | An incorrect equality check can cause the JIT to optimize away the initialization of a live WebAssembly GC struct, creating a fake-object primitive with potential arbitrary read/write in code that had undergone extensive fuzzing by internal and external researchers. |
| 2024437 | A 15-year-old bug in the |
| 2021894 | Reliably exploits a race condition over IPC, allowing a compromised content process to manipulate IndexedDB refcounts in the parent to trigger a UAF and potential sandbox escape. |
| 2022034 | A raw NaN crossing an IPC boundary can masquerade as a tagged JS object pointer, turning double deserialization into a parent-process fake-object primitive for a sandbox escape. |
| 2024653 | An intricate testcase weaving through nested event loops, pagehide listeners, and garbage collection to trigger a UAF in the attribute setter for elements. |
| 2022733 | Triggers a parent UAF by flooding WebTransport with thousands of certificate hashes to stretch a race condition in a refcount-heavy copy loop, and exploits that race condition over IPC from a compromised content process. |
| 2023958 | Simulates a malicious DNS server by intercepting glibc DNS function calls in order to reproduce a UDP->TCP fallback edge case, triggering a buffer over-read and parent-process stack memory leak during HTTPS RR & ECH parsing. |
| 2025977 | 20-year-old XSLT bug in which reentrant key() calls cause a hash table rehash that frees its backing store while a raw entry pointer is still in use (one of several sec-high issues we fixed involving XSLT). |
| 2027298 | Patches the color picker to simulate otherwise non-automatable user selection, then uses a synchronous input event to spin a nested event loop that re-enters actor teardown and frees the callback while it is still unwinding, triggering a content process UAF. |
| 2023817 | A compromised content process could send an arbitrary wallpaper image to be decoded in the parent process, which could be paired with a hypothetical vulnerability in an image decoder to escape the sandbox. This entailed difficult-to-automate reasoning about the trust-level of inputs in the parent process. |
| 2029813 | Escapes our in-process sandboxing technology for third-party libraries (RLBox) by leveraging a gap in the verification logic used to copy values from the untrusted to the trusted side of the sandbox boundary. |
| 2026305 | Extremely small testcase that exploits the special rowspan=0 semantics in HTML tables by appending >65535 rows to bypass clamping and overflow a 16-bit layout bitfield, which went undetected for years by fuzzers. |
Note that a number of these bugs are sandbox escapes, which would need to be combined with other exploits to achieve a full-chain Firefox compromise. These reports presume that the sandboxed process that renders site content has already been compromised with some separate bug, and is now running attacker-controlled machine code attempting to escalate control into the privileged parent process. When crafting a sandbox escape, the model is permitted to patch the Firefox source code, so long as the modified code is restricted to run only in the sandboxed process[1]. Such bugs are notoriously difficult to find with fuzzing, and while we’ve had some success developing new techniques to close this gap, AI analysis provides much more comprehensive coverage of this critical surface.
Just as interesting as what the models found is what they didn’t find — not because they didn’t try, but because they were unable to circumvent Firefox’s layered defenses. For example, in recent years we received several clever reports from security researchers that managed to escape the process sandbox by triggering prototype pollution in the privileged parent process. Rather than fixing these problems one-by-one, we made an architectural change to freeze these prototypes by default. While auditing logs from the harness, we saw many attempts to pursue this line of escape that were thwarted by this design. Observing such direct payoff from previous hardening work was even more rewarding than finding and fixing more bugs.
We’ve experimented internally with LLM code audits over the past few years, with early attempts using models like GPT 4 or Sonnet 3.5 to statically analyze high risk code for vulnerabilities. These experiments showed some promise, but the high rate of false positives made them impractical to scale.
The introduction of agentic harnesses that can reliably detect security issues has completely changed this. These can find real bugs and dismiss unreproducible speculation. The key feature of such a harness is that, given the right interfaces and instructions, it can create and run reproducible test cases to dynamically test hypotheses about bugs in code. After fixing the initial set of issues that Anthropic sent to us in February, we built our own harness atop our existing fuzzing infrastructure.
We began with small-scale experiments prompting the harness to look for sandbox escapes with Claude Opus 4.6. Even with this model, we identified an impressive amount of previously-unknown vulnerabilities which required complex reasoning over multiprocess browser engine code. At first, we supervised the process in the terminal to observe the process in real-time and tune the prompts and logic. Once this was working well, we parallelized the jobs across multiple ephemeral VMs, each tasked to hunt for bugs within a specific target file and write its findings back to a bucket.
A discovery subsystem is necessary but not sufficient. In order to scale the effort, we needed to integrate it with our full security bug lifecycle: determining what to look for, where to look, and how to handle what it produces. This last part includes deduplicating against known issues, tracking bugs, triaging them, and getting fixes shipped. While the model is the core primitive powering the harness, this full pipeline is necessary to make it useful at scale.
While harnesses may be reusable across projects, this pipeline is inherently project-specific, reflecting each codebase’s semantics, tooling, and processes. Standing this up required significant iteration, with a tight feedback loop alongside the Firefox engineers who were fielding the incoming bugs.
Once the end-to-end pipeline is in place, it’s trivial to swap in different models when they become available. Building this pipeline early helped us find a number of serious bugs using publicly-available models, and it also helped us hit the ground running when we had the opportunity to evaluate Claude Mythos Preview. In our experience, model upgrades increase the effectiveness of the entire pipeline: the system gets simultaneously better at finding potential bugs, creating proof-of-concept test cases to demonstrate them, and articulating their pathology and impact.
In addition to fixing the 271 bugs identified by Claude Mythos Preview in the 150 release, we’ve shipped more of these fixes in 149.0.2, 150.0.1, and 150.0.2. We also continue to find bugs with other means internally, and, similar to other projects, we’ve seen a significant uptick in external reports in the last few months.

Ultimately, every bug requires care and attention to properly fix. Staying on top of this unprecedented volume has led to a lot of work and long days over the last few months, and we’re extremely proud of how the team has stepped up to meet this challenge. Over 100 people contributed code to this effort to ship the most secure Firefox yet. In addition to writing and reviewing patches, others have been building and scaling this pipeline, triaging, testing the fixes, and managing the release process for each bug.
Anyone building software can start using a harness with a modern model to find bugs and harden their code today. We recommend getting started now. You will find bugs, and you will set yourself up to take advantage of new models as soon as they become available.
You can start with very simple prompting, then observe and iterate. Our initial prompts were not dissimilar from those described here. Through iteration we’ve built out a lot of orchestration and tooling to optimize and scale the pipeline, but the essence of the inner loop remains the same: there is a bug in this part of the code, please find it and build a testcase.
We haven’t bottomed on all the latent bugs in Firefox, but are quite pleased with the trajectory. Today, our scanning is largely focused on specific areas of the code (files, functions) where we instruct the system to look, based on a mix of human judgement and automated signals. In the near future, we intend to integrate this analysis into our continuous integration system to scan patches as they land in the tree. Models are quite flexible with the form of context provided, and we expect patch-based scanning to work as well or even better than file-based scanning.
The current moment is a perilous one, but also full of opportunity. Let’s work together to secure the internet.
On the advisories web page we group all internally-reported bugs as “rollup” CVEs with multiple bugs underneath them. The web page is built from yaml in the foundation-security-advisories repo, the canonical location for our CVE assignments. While some browsers do not create CVE identifiers for internally-discovered issues at all, we provide this information in order to be as transparent as possible.
In Firefox 150, there were three internal rollups: CVE-2026-6784 (154 bugs), CVE-2026-6785 (55 bugs), and CVE-2026-6786 (107 bugs).
Astute readers will notice the number of bugs in those internal rollups adds up to 316, which is more than the 271 we announced finding with Claude Mythos Preview. That’s because our security team hunts for new bugs every day by attacking Firefox with a combination of (a) fuzzing systems (b) manual inspection and (c) this new agentic pipeline across a variety of models.
We fixed a total of 423 security bugs in releases in April. In addition to the 271 bugs announced two weeks ago, there were 41 externally reported bugs, with the remaining 111 discovered internally and split roughly in third between:
Note that we also directly credited 3 CVEs to Anthropic separate from this latest effort (CVE-2026-6746, CVE-2026-6757, CVE-2026-6758). These were fixes for bugs sent to us by the outstanding Anthropic Frontier Red team a couple months ago and we assigned unique CVEs for each as per our normal process.
As additional context, we apply security severity ratings from critical to low to indicate the urgency of a bug:
Of the 271 bugs we announced for Firefox 150: 180 were sec-high, 80 were sec-moderate, and 11 were sec-low.
While we care most about critical/high bugs, it’s normal for us to prioritize moderate and low security bugs in order to fix correctness issues and as a defense-in-depth mechanism.
Not necessarily.
In most cases, a single critical/high bug is not actually enough to compromise Firefox. This is because Firefox has a defense-in-depth architecture, so for example exploiting a JIT bug only achieves remote code execution in a sandboxed and site-specific process. Real-world attackers generally need to chain multiple exploits together to escalate privileges through one or more layers of sandboxing along with OS-level mitigations like ASLR.
We also generally don’t build exploits to see whether a bug could be used by an attacker in the real world. We classify sec-high based on predictable crash symptoms such as use-after-free or out-of-bounds memory issues being reported by AddressSanitizer, and our threat model assumes that any of them could be exploitable with sufficient effort. This reduces the risk of a false negative during exploitability analysis, and more importantly it allows us to focus our resources on finding and fixing more vulnerabilities.
[1] Our bug bounty program has similar rules. ↩
Distinguished Engineer, Firefox
More articles by Brian Grinstead…
Christian is a Firefox Tech Lead and Principal Engineer at Mozilla.
More articles by Christian Holler…
Frederik Braun manages the Firefox Application Security team. He builds security for the web and for Mozilla Firefox from Berlin. As a contributor to standards, Frederik is also improving the web platform by bringing security into the defaults with specifications like the Sanitizer API and Subresource Integrity. When not at work, Frederik likes reading a good novel or going on long bike treks across Europe.