I, for the NFL front offices, created a script that exposed an API to fully automate Ticketmaster through the front end so that the NFL could post tickets on all secondary markets and dynamic price the tickets so if rain on a Sunday was expected they could charge less. Ticketmaster was slow to develop an API. Ticketmaster couldn't provide us permission without first developing the API first for legal reasons but told me they would do their best to stop me.
They switched over to PerimeterX which took me 3 days to get past.
Last week someone posted an article here about ChatGPT using Cloudflare Turnstile. [0] First, the article made some mistakes how it works. Second, I used the [AI company product] and the Chrome DevTools Protocol (CDP) to completely rewrite all the scripts intercepting them before they were evaluated -- the same way I was able to figure out PerimeterX in 3 days -- and then recursively solve controlling all the finger printing so that it controls the profile. Then it created an API proxy to expose ChatGPT for free. It required some coaching about the technique but it did most of the work in 3 hours.
These companies are spending 10s of millions of dollars on these products and considering what OpenAI is boasting about security, they are worthless.
And yet... Wireguard was written by one guy while OpenVPN is written by a big team. One code base is orders of magnitude bigger than the other. Which should I bet LLMs will find more cybersecurity problems with? My vote is on OpenVPN despite it being the less clever and "more money thrown at" solution.
So yes, I do think you get points for being clever, assuming you are competent. If you are clever enough to build a solution that's much smaller/simpler than your competition, you can also get away with spending less on cybersecurity audits (be they LLM tokens or not).
Would it? I’m old school but I’ve never trusted these massive dependency chains.
That’s a nit.
We’re going to have to write more secure software, not just spend more.
Really depends how consistently the LLMs are putting new novel vulnerabilities back in your production code for the other LLMs to discover.
But I don't really get the hype, we can fix all the vulnerabilities in the world but people are still going to pick up parking-lot-USBs and enter their credentials into phishing sites.
I think were are already here. I wrote something about this, if you are interested: https://go.cbk.ai/security-agents-need-a-thinner-harness
Better to write good, high-quality, properly architected and tested software in the first place of course.
Edited for typo.
You can do a lot better efficiency-wise if you control the source end-to-end though - you already group logically related changes into PRs, so you can save on scanning by asking the LLM to only look over the files you've changed. If you're touching security-relevant code, you can ask it for more per-file effort than the attacker might put into their own scanning. You can even do the big bulk scans an attacker might on a fixed schedule - each attacker has to run their own scan while you only need to run your one scan to find everything they would have. There's a massive cost asymmetry between the "hardening" phase for the defender and the "discovering exploits" phase for the attacker.
Exploitability also isn't binary: even if the attacker is better-resourced than you, they need to find a whole chain of exploits in your system, while you only need to break the weakest link in that chain.
If you boil security down to just a contest of who can burn more tokens, defenders get efficiency advantages only the best-resourced attackers can overcome. On net, public access to mythos-tier models will make software more secure.
I wouldn't be surprised if NVIDIA picked up this talking point to sell more GPUs.
For example from this article:
> Karpathy: Classical software engineering would have you believe that dependencies are good (we’re building pyramids from bricks), but imo this has to be re-evaluated, and it’s why I’ve been so growingly averse to them, preferring to use LLMs to “yoink” functionality when it’s simple enough and possible.
Anyone who's heard of "leftpad" or is a Go programmer ("A little copying is better than a little dependency" is literally a "Go Proverb") knows this.
Another recent set of posts to HN had a company close-sourcing their code for security, but "security through obscurity" has been a well understand fallacy in open source circles for decades.
> Worryingly, none of the models given a 100M budget showed signs of diminishing returns. “Models continue making progress with increased token budgets across the token budgets tested,” AISI notes.
So, the author infers a durable direct correlation between token spend and attack success. Thus you will need to spend more tokens than your attackers to find your vulnerabilities first.
However it is worth noting that this study was of a 32-step network intrusion, which only one model (Mythos) even was able to complete at all. That’s an incredibly complex task. Is the same true for pointing Mythos at a relatively simple single code library? My intuition is that there is probably a point of diminishing returns, which is closer for simpler tasks.
In this world, popular open source projects will probably see higher aggregate token spend by both defenders and attackers. And thus they might approach the point of diminishing returns faster. If there is one.
In fact, security programs built on the idea that you can find and patch every security hole in your codebase were basically busted long before LLMs.
Put more simply: to keep your system secure, you need to be fixing vulnerabilities faster than they're being discovered. The token count is irrelevant.
Moreover: this shift is happening because the automated work is outpacing humans for the same outcome. If you could get the same results by hand, they'd count! A sev:crit is a sev:crit is a sev:crit.
There is at least a possibility that a code base can be secured by a (practically) finite number of tokens until there is no more holes in it, for reasonable amounts of money.
This also reminds me of what I wrote here: https://jerf.org/iri/post/2026/what_value_code_in_ai_era/ There's still value in code tested by the real world, and in an era of "free code" that may become even more true than it is now, rather than the initially-intuitive less valuable. There is no amount of testing you can do that will be equivalent to being in the real world, AI-empowered attackers and all.
(It's true that formalization can still have bugs in the definition of "secure" and doesn't work for everything, which means defenders will still probably have to allocate some of their token budget to red teaming.)
Assuming your code is inaccessible isn't good for security. All security reviews are done assuming code source is available. If you don't provide the source, you'll never score high in the review.
You cannot get away with „well no one is going to spend time writing custom exploit to get us” or „just be faster than slowest running away from the bear”.
What accounts are these?
I've seen some people use this but I cannot imaging that anyone thinks this is the best.
For example I've had success telling LLMs to scan from application entry points and trace execution, and that seems an extremely obvious thing to do. I can't imagine others in the field don't have much better approaches.
It seems like that is perhaps not the case anymore with the Mythos model?
1) massive companies spending millions of tokens to write+secure their software
2) in the shadows, "elite" software contractors writing bespoke software to fulfill needs for those who can't afford the millions, or fix cracks in (1)
(Oh wait, I think this is what is happening now, anyway, minus the millions of tokens)
1) The number of vulnerabilities surfaced (and fixed?) in a given software is roughly proportional to the amount of attention paid to it.
2) Attention can now be paid in tokens by burning huge amounts of compute (bonus: most commonly on GPUs, just like crypto!)
3) Whoever finds a vulnerability has a valuable asset, though the value differs based on the criticality of the vulnerability itself, and whether you're the attacker or the defender.
More tokens -> more vulns is not a guarantee of course, it's a stochastic process... but so is PoW!
I disagree.
The defender must be right every single time. The attacker only has to get lucky and thanks to scale they can do that every day all day in most large organizations.
I want to believe formal methods can help, not because one doesn't have to think about it, but because the time freed from writing code can be spent on thinking on systems, architecture and proofs.
But part of me has been wondering for a while now whether proofs of correctness is the way out of the NVIDIA infinite money glitch. IDK if we're there yet but it's pretty much the only option I can imagine.
You can only do this if you have a very clear sense of what your code should be doing. In most codebases I've ever worked with, frankly, no one has any idea.
Red teaming as an approach always has value, but one important characteristic it has is that you can apply red teaming without demanding any changes at all to your code standards, or engineering culture (and maybe even your development processes).
Most companies are working with a horrific sprawl of code, much of it legacy with little ownership. Red teaming, like buying tools and pushing for high coverage, is an attractive strategy to business leaders because it doesn't require them to tackle the hardest problems (development priorities, expertise, institutional knowledge, talent, retention) that factor into application security.
Formal verification is unfortunately hard in the ways that companies who want to think of security as a simple resource allocation problem most likely can't really manage.
I would love to work on projects/with teams that see formal verification as part of their overall correctness and security strategy. And maybe doing things right can be cheaper in the long run, including in terms of token burn. But I'm not sure this strategy will be applicable all that generally; some teams will never get there.
[0] https://securitycryptographywhatever.com/2026/03/25/ai-bug-f...
Taken to an extreme, the end result is a dark forest. I don't like what that means for entrepreneurship generally.
Well, you need to harden everything, the attacker only needs to find one or at most a handful of exploits.
Chinese AI vendors specifically pointed out that even a few gens ago there was maybe 5-15% more capability to squeeze out via training, but that the cost for this is extremely prohibitive and only US vendors have the capex to have enough compute for both inference and that level of training.
I'd take their word over someone that has a vested interested in pushing Anthropic's latest and greatest.
The real improvements are going to be in tooling and harnessing.
I think there is work to be done on scaffolding the models better. This exponential right now reminds me of the exponential from CPU speeds going up until let’s say 2000 or something where you had these game developers who would develop really impressive games on the current thing of hardware and they do it by writing like really detailed intricate x86 instruction sequences for like just exactly whatever this, like, you know, whatever 486 can do, knowing full well that in 2 years, you know, the pen team is gonna be able to do this much faster and they didn’t need to do it. But like you need to do it now because you wanna sell your game today and like, yeah, you can’t just like wait and like have everyone be able to do this. And so I do think that there definitely is value in squeezing out all of the last little juice that you can from the current model.
Everything you can do today will eventually be obsoleted by some future technology, but if you need better results today, you actually have to do the work. If you just drop everything and wait for the singularity, you're just going to unnecessarily cap your potential in the meantime.
Here we go again.
Yeah, but it's not like the attacker knows where to look without checking everything, it it?
If you harden and fix 90% of vulns, the attacker may give up when their attempts reach 80% of vulns.
It's the same as it has ever been; you don't need to outrun the bear, you only need to outrun the other runners.
Unfortunately, they fit straight lines to graphs with y axis from 0 to 100% and x axis being time - which is not great. Should do logistic instead.
I dunno about that quoted bit; "Defense in depth" (Or defense via depth) is a good thing, and obscurity is just one of those layers.
"Security through obscurity" is indeed wrong if the obscurity is a large component of the security, but it helps if it is just another layer of defense in the stack.
IOW, harden your system as if it were completely transparent, and only then make it opaque.
Maybe we could start with the prompts for the code generation models used by developers.
I wouldn't use those as excuses to dismiss AI though. Even if this model doesn't break your defences, give it 3 months and see where the next model lands.
Because we have tools and techniques that can guarantee the absence of certain behavior in a bounded state space using formal methods (even unbounded at times)
Sure, it's hard to formally verify everything but if you are dealing with something extremely critical why not design it in a way that you can formally verify it?
But yeah, the easy button is keep throwing more tokens till you money runs out of money
After how many years of "shifting left" and understanding the importance of having security involved in the dev and planning process, now the recommendation is to vibe code with human intuition, review then spend a million tokens to "harden"?
I understand that isn't the point of the article and the article does make sense in its other parts. But that last paragraph leaves me scratching my head wondering if the author understands infosec at all?
Yeah, it sucks. But you're getting paid, among other things, to put up with some amount of corporate suckiness.
Until the attacker has initial access.
Then the attacker needs to be right every single time.
1. A proof mindset is really hard to learn.
2. Writing theorem definitions can be hard, but writing a proof can be even harder. So, if you could write just the definitions, and let an LLM handle all the tactics and steps, you could use more advanced techniques than just a SAT solver.
So I guess LLMs only marginally help with (1), but they could potentially be a big help for (2), especially with more tedious steps. It would also allow one to use first order logic, and not just propositional logic (or dependant types if you're into that).
I think the important thing is to avoid over-optimizing. Your scaffold, not avoid building one altogether.
It does mean that the hoped-for 10x productivity increase from engineers using LLMs is eroded by the increased need for extra time for security.
This take is not theoretical. I am working on this effort currently.
Sorry, how does that work?
Prediction 1. We're going to have cheap "write Photoshop and AutoCad in Rust as a new program / FOSS" soon. No desktop software will be safe. Everything will be cloned.
Prediction 2. We'll have a million Linux and Chrome and other FOSS variants with completely new codebases.
Prediction 3. People will trivially clone games, change their assets. Modding will have a renaissance like never before.
Prediction 4. To push back, everything will move to thin clients.
(Fan of your writing, btw.)
Of course it's trivially NOT true that you can defend against all exploits by making your system sufficiently compact and clean, but you can certainly have a big impact on the exploitable surface area.
I think it's a bit bizarre that it's implicitly assumed that all codebases are broken enough, that if you were to attack them sufficiently, you'll eventually find endlessly more issues.
Another analogy here is to fuzzing. A fuzzer can walk through all sorts of states of a program, but when it hits a password, it can't really push past that because it needs to search a space that is impossibly huge.
It's all well and good to try to exploit a program, but (as an example) if that program _robustly and very simply_ (the hard part!) says... that it only accepts messages from the network that are signed before it does ANYTHING else, you're going to have a hard time getting it to accept unsigned messages.
Admittedly, a lot of today's surfaces and software were built in a world where you could get away with a lot more laziness compared to this. But I could imagine, for example, a state of the world in which we're much more intentional about what we accept and even bring _into_ our threat environment. Similarly to the shift from network to endpoint security. There are for sure, uh, million systems right now with a threat model wildly larger than it needs to be.
For instance, if failing any step locks you out, your probability of success is p^N, which means it’s functionally impossible with enough layers.
Are these totally previously unknown security holes or are they still generally within the umbrella of our understanding of cybersecurity itself?
If it's the latter, why can't we systematically find and fix them ourselves?
In the case of crooks (rather than spooks) that often means your security has to be as good as your peers, because crooks will spend their time going with the best gain/effort ratio.
It's nuts. If the timing were slightly different, none of this "Cybersecurity" would even be a thing. We'd just have capabilities based, secure general purpose computation.
I predict the software ecosystem will change in two folds: internal software behind a firewall will become ever cheaper, but anything external facing will become exponential more expensive due to hacking concern.
When things are tagged "cybersecurity", compliance/budget/manager/dashboard/education/certification are the usual response...
I don't think it would be an appropriate response for code quality issues, and it would likely escape the hands of the very people who can fix code quality issues, ie. developers.
> You don’t get points for being clever
Not sure about this framing, this can easily lead to the wrong conclusions. There is an arms race, yes, and defenders are going to need to spend a lot of GPU hours as a result. But it seems self-evident that the fundamentals of cybersecurity still matter a lot, and you still win by being clever. For the foreseeable future, security posture is still going to be a reflection of human systems. Human systems that are under enormous stress, but are still fundamentally human. You win by getting your security culture in order to produce (and continually reproduce) the most resilient defense that masters both the craft and the human element, not just by abandoning human systems in favor of brute forcing security problems away as your only strategy.
Indeed, domains that are truly security critical will acquire this organizational discipline (what's required is the same type of discipline that the nuclear industry acquires after a meltdown, or that the aviation industry acquires after plane crashes), but it will be a bumpy ride.
This article from exactly 1 year ago is almost prophetic to exactly what's going on right now and the subtle ways in which people are most likely to misunderstand the situation: https://knightcolumbia.org/content/ai-as-normal-technology
The benchmark might be a good apples-to-apples comparison but it is not showing capability in an absolute sense.
Of course those are attracted to new tools and AI shill institutes like AISI (yes, the UK government is shilling for AI, it understands a proper grift that benefits the elites).
Security "research" is perfect for talkers and people who produce powerpoint graphs that sell their latest tools.
You still can sit down and write secure software, while the "researchers" focus on the same three soft targets (sudo, curl, ffmpeg) over an over again and get $100,000 in tokens and salaries for a bug in a protocol from the 1990s that no one uses. Imagine if this went to the authors instead.
But no, government money MUST go to the talkers and powerpointists. Always.
What's new?
It was always about spending more money on something.
Team has no capacity? Because the company doesn't invest in the team, doesn't expand it, doesn't focus on it.
We don't have enough experts? Because the company doesn't invest in the team, doesn't raise the salary bar to get new experts, it's not attractive to experts in other companies.
It was always about "spending tokens more than competitors", in every area of IT.
There are several simultaneous moving targets: the different models available at any point in time, the model complexity/ capability, the model price per token, the number of tokens used by the model for that query, the context size capabilities and prices, and even the evolution of the codebase. You can’t calculate comparative ROIs of model A today or model B next year unless these are far more predictable than they currently are.
As it is, we're stuck with "yeah it seems this works well for bootstrapping a Next.js UI"...
Obvious possibilities include:
* More use of software patents, since these apply to underlying ideas, rather than specific implementations.
* Stronger DMCA-like laws which prohibit breaking technical provisions designed to prevent reverse engineering.
Similarly, if the people predicting that humans are going to be required to take ultimate responsibility for the behaviour of software are correct, then it clearly won't be possible for that to be any random human. Instead you'll need legally recognised credentials to be allowed to ship software, similar to the way that doctors or engineers work today.
Of course these specific predictions might be wrong. I think it's fair to say that nobody really knows what might have changed in a year, or where the technical capabilities will end up. But I see a lot of discussions and opinions that assume zero feedback from the broader social context in which the tech exists, which seems like they're likely missing a big part of the picture.
The only process that scared me was windowgrid. It kept finding a way back when I killed all the "start with boot" locations I know. Run, runonce, start up apps, etc. Surely it's not in autoexec.bat :)
That's a really big "if". Particularly since so many companies don't even know all of the OSS they are using, and they often use OSS to offload the cost of maintaining it themselves.
My hope is when the dust settles, we see more OSS SAST tools that are much better at detecting vulnerabilities. And even better if they can recommend fixes. OSS developers don't care about a 20 point chained attack across a company network, they just want to secure their one app. And if that app is hardened, perhaps that's the one link of the chain the attackers can't get past.
This seems wrong however, as it ignores the arrow of time. The full source code has been scanned and fixed for things that LLMs can find before hitting production, anyone exfiltrating your codebase can only find holes in stuff with their models that is available via production for them to attack and that your models for some reason did not find.
I don't think there is any reason to suppose non-nation state actors will have better models available to them and thus it is not a dark forest, as nation states will probably limit their attacks to specific things, thus most companies if they secure their codebase using LLMs built for it will probably be at a significantly more secure position than nowadays and, I would think, the golden age of criminal hacking is drawing to a close. This assume companies smart enough to do this however.
Furthermore, the worry about nation state attackers still assumes that they will have better models and not sure if that is likely either.
Of course LLMs see a lot more source-assembly pairs than even skilled reverse engineers, so this makes sense. Any area where you can get unlimited training data is one we expect to see top-tier performance from LLMs.
(also, hi Thomas!)
I would think, the golden age of criminal hacking is drawing to a close. This assume companies smart enough to do this however.
It's rarely the systems that are the weak link, rather the humans with backdoor access.I don't see the connection.
Last week we learned about Anthropic’s Mythos, a new LLM so “strikingly capable at computer security tasks” that Anthropic didn’t release it publicly. Instead, only critical software makers have been granted access, providing them time to harden their systems.
We quickly blew through our standard stages of processing big AI claims: shock, existential fear, hype, skepticism, criticism, and (finally) moving onto the next thing. I encouraged people to take a wait-and-see approach, as security capabilities are tailor-made for impressive demos. Finding exploits is a clearly defined, verifiable search problem. You’re not building a complex system, but poking at one that exists. A problem well suited to throwing millions of tokens at.
Yesterday, the first 3rd party analysis landed, from the AI Security Institute (AISI), largely supporting Anthropic’s claims. Mythos is really good, “a step up over previous frontier models in a landscape where cyber performance was already rapidly improving.”
The entire report is worth reading, but I want to focus on the following chart, detailing the ability of different models to successfully complete a simulated, complex corporate network attack:

“The Last Ones” is, “a 32-step corporate network attack simulation spanning initial reconnaissance through to full network takeover, which AISI estimates to require humans 20 hours to complete.” The lines are the average performance across multiple runs (10 runs for Mythos, Opus 4.6, and GPT-5.4), with the “max” lines representing the best of each batch. Mythos was the only model to complete the task, in 3 out of its 10 attempts.
This chart suggests an interesting security economy: to harden a system we need to spend more tokens discovering exploits than attackers spend exploiting them.
AISI budgeted 100M tokens for each attempt. That’s $12,500 per Mythos attempt, $125k for all ten runs. Worryingly, none of the models given a 100M budget showed signs of diminishing returns. “Models continue making progress with increased token budgets across the token budgets tested,” AISI notes.
If Mythos continues to find exploits so long as you keep throwing money at it, security is reduced to a brutally simple equation: to harden a system you need to spend more tokens discovering exploits than attackers will spend exploiting them.
You don’t get points for being clever. You win by paying more. It is a system that echoes cryptocurrency’s proof of work system, where success is tied to raw computational work. It’s a low temperature lottery: buy the tokens, maybe you find an exploit. Hopefully you keep trying longer than your attackers.
This calculus has a few immediate takeaways:
First, open source software remains critically important.
For those of you who aren’t exposed to AI maximalists, this statement feels absurd. But lately, after the LiteLLM and Axios supply chain scares, many have argued for reimplementing dependency functionality using coding agents.
Here’s Karpathy, just a few weeks ago:
Classical software engineering would have you believe that dependencies are good (we’re building pyramids from bricks), but imo this has to be re-evaluated, and it’s why I’ve been so growingly averse to them, preferring to use LLMs to “yoink” functionality when it’s simple enough and possible.
If security is purely a matter of throwing tokens at a system, Linus’s law that, “given enough eyeballs, all bugs are shallow,” expands to include tokens. If corporations that rely on OSS libraries spend to secure them with tokens, it’s likely going to be more secure than your budget allows. Certainly, this has complexities: cracking a widely used OSS package is inherently more valuable than hacking a one-off implementation, which incentivizes attackers to spend more on OSS targets.
Second, hardening will be an additional phase for agentic coders.
We’ve already been seeing developers break their process into two steps, development and code review, often using different models for each phase. As this matures, we’re seeing purpose-built tooling meeting this pattern. Anthropic launched a code review product that costs $15-20 per review.
If the above Mythos claims hold, I suspect we’ll see a three phase cycle: development, review, and hardening.
Critically, human input is the limiter for the first phase and money is the limiter for the last. This quality inherently makes them separate stages (why spend to harden before you have something?). Previously, security audits were rare, discrete, and inconsistent. Now we can apply them constantly, within an optimal (we hope!) budget.
Code remains cheap, unless it needs to be secure. Even if costs go down as inference optimizations, unless models reach the point of diminishing security returns, you still need to buy more tokens than attackers do. The cost is fixed by the market value of an exploit.
Burning tokens by asking the LLM to compile, disassemble, compare assembly, recompile, repeat seems very wasteful and inefficient to me.
You might well be right, it is not an area I know much of or work in. But I'm a fan of reliable sources for claims. It is far to easy to make general statements on the internet that appear authorative.
It is not that one would design a system in this manner because you'd never design a loophole in no matter the steps it takes to get there: it is just a benchmark.
Companies that market to the EU are going to need to find out real fast.
Seems like a waste of money; wouldn't it be better to extract the AST deterministically, write it out and only then ask an LLM to change those auto-generated symbol names with meaningful names?
In other terms, I feel the argument from TFA generally checks out, just on a different level than "more GPU wins". It's one up: "More money wins". That's based on the premise that more capable models will be more expensive, and using more of it will increase the likelihood of finding an exploit, as well as the total cost. What these model providers pay for GPUs vs R&D, or what their profit margin is, I'd consider less central.
But then again, AI didn't change this, if you have more money you can find more exploits: Whether a model looks for them or a human.
As soon as there are multiple programs with full authority on your data, "cybersecurity" happens. And internet/web is that to the power of 100.
If there is only one bear, you just need to run faster than your friends. If there's a pack of them, it you need to start training much harder!
For example, developers should no longer run dev environments on the same machine where they access passwords, messages, and emails — no external package installation on that box at all.
SaaS Password Managers — assume your vault will be stolen from whichever provider is hosting it.
Ubikeys will be more important than ever to airgap root auth credentials.
That would have started a P2 and woken up a senior IR responder anywhere that I’ve worked. Are you sure you’re running a realistic defender environment?
- a very large codebase
- a codebase which is not modularized into cohesive parts
- niche languages or frameworks
- overly 'clever' code
I tend to encourage Firefox over Cr flavoured browsers because FF (for me) are the absolute last to dive in with fads and will boneheadedly argue against useful stuff until the cows come home ... Web Serial springs to mind (which should finally be rocking up real soon now).
Oh and they are not sponsored by Google errm ... 8)
I'm old enough to remember having to use telnet to access the www (when it finally rocked up and looked rather like Gopher and WAIS) (via a X.25 PAD) and I have seen the word "unsupported" bandied around way too often since to basically mean "walled garden".
I think that when you end up using the term "unsupported browser" you have lost any possible argument based on reason or common decency.
why in the absolute fuck would I want random web pages to be able to control all the devices connected to my computer?
Big if true. Can you cite an example? I'm all ears.