As time progresses one now has a yard stick to measure against progress. No more excuses - show me the money baby.
I'm not sure that's true anymore considering how popular Simon's blog is
They definitely get something barebones up and running, but it's far from a fully fledged application.
I feel like if anything people started to realise the significant limitations of LLMs when you try to use them as ‘agents’ which was the big direction LLM companies tried to push recently.
Best use of LLMs so far IMO is finding vulnerabilities (with human help) and pattern matching in other domains. For generating code and prose they are still mediocre and somewhat unreliable and for use as personal assistant agents I wouldn’t trust them.
So what’s happening with openclaw, the biggest experiment in agentic, vibe coded by the agents themselves? The thing that was so hot a few months ago.
https://github.com/openclaw/openclaw/pulse?period=daily
279 commits to main from 77 authors in the last 24 hours.
Why is there so much churn and how could you trust it with your data? This is changes in ONE day!
If these are useful changes, surely it’d be superhuman by now given months of this pace.
What are people using this for?
It is getting very good at producing code that compiles - at the algorithmic level.
This is definitely noteworthy - and the AI is crossing a critical 'productivity threshold'.
But 'Drawing of a Proper Duck' is almost arbitrary because it may have nothing to do with the 'Specific Duck You Wanted'.
Everyone has tried to get AI to 'Draw The Thing They Want' and you notice immediately how it's almost impossible to 'adjust the image' along the vector you want - because ... and this is key:
-> the AI doesn't really understand what a Duck is, it's components, or fully how it made the duck <-
It just knows how to 'incant' the duck.
This becomes very clear when you try to get the AI to write proper documentation - it fails so miserably, even with direct guidance.
This is really strong evidence of how poorly the AI is generalizing, and that it is not 'understanding' rather it's 'synthesizing' from patterns.
We already kind of knew that - but we have not yet built an intuition for that until now.
Only now can we see 'how amazing the pattern synthesis' is - it's almost magic, and yet how it falls off a cliff otherwise
This has deep implications for the 'road ahead' and the kinds of things we're going to be able to do with AI.
In short: the AI is 'Wizard Level Code Helper, Researcher, and Worker' - but it very clearly lacks capabilities even one level of abstraction above the code itself.
LLMs were first trained by 'text' and now ... they are 'trained by our compilers'. Basically g++, javac, tsc are the 'Verifiable Human Rewards' in the post-training and reinforcement learning - and the AI is getting extremely good at producing 'code that compiles', but that's definitely an indirection from 'code that does what we want'.
It's astonishing that it took us all this time to internalize and start to discover what I think will be in hindsight a very obvious 'threshold' of it's capabilities.
We are constantly 'amazed' at the work that it can do, and therefore over-project it's capabilities.
I have no doubt that even with these limitations - the AI will unlock a lot more as it gets better - and - that it will 'creep up' the layers of abstraction of it's understanding.
But I strongly believe that the AI is going to get much 'wider' (pattern matching dominance) before it gets 'higher' (intrinsic understanding) - and - that this may be a fundamental limitation.
This may be 'the Le Cunn' insight - when he talks about the limitations of LLMs in detail - I believe this is that insight writ large.
Even the term AI - or certainly 'AGI' may be a misleading metaphor - were we to have always called it 'Stochastic Algorithms' or something along those lines, it's possible that our intuition would be framed a bit better.
The most interesting thing is how it is definitely amazing, world changing, novel and powerful and some ways - and obviously useless in others at the same time. That's the 'threshold' we need to better understand.
Does that suggest the uplift was only for things that are easily verifiable like code?
Personal: my wife tutors in her native language to non-native primary and high school kids. They are all using these tools now generate fresh content for practice based on school lesson plans. The kids are improving much more quickly now than they were just a few months ago.
> I think this mainly demonstrates that the pelican on the bicycle has firmly exceeded its limits as a useful benchmark.
As acknowledged in the article.
"and then you have to get a mac mini, and then, and then"
smile and nod, it pays weekly
Then the nerf, and the massive uplift in tokens for 4.7, a model which I find lazy and prone to hallucinate.
It's probably time to try GPT5.5. Like many I'm pretty heavily invested in the anthropic ecosystem at this point, which I suppose gives another strong reason to make the switch.
GPT 5.5 is a significant improvement over GPT 5.4 but I wouldn't call it an inflection.
That might be the case, but Simon's case "Generate an SVG of a pelican riding a bicycle" is very different.
The model actually has to understand what parts of a pelican and bicycle come together in something like an anatomically plausible way. That's a higher level of abstraction than something like passing the same prompt to Stable Diffusion etc
(The new Nano Banana/GPT Image 2.0 models are different though - they have significant world knowledge baked in)
Yes, there are ways to convert raster images to SVG for use in training data but it's not a good use of anyone's time.
I did write some stuff myself just to learn how the enigma encryption machine worked, so wrote myself to learn. But professionally, I stopped coding in November.
'Nail Guns' used to be heavy, required heavy power cords, they were extremely expensive. When they got lighter, cheaper, battery pack ... at some point, they blend seamlessly into the roofers process, and multiply dramatically the work that can be done. Marginal improvements beyond that may not yield the same 'unlocks' because the threshold has been crossed.
At any point you need to have agents review, verify and test the other agents output and iterate until the output is perfect.
And also, have good e2e tests.
IMO, if you don't spend at least a few tens of millions tokens per day, you aren't doing it properly.
I met a few people at PyCon this week who have been part of Glasswing (they're just starting to be allowed to talk about it) and it really does drive down the cost of finding vulnerabilities.
I've been collecting notes on that here: https://simonwillison.net/tags/ai-security-research/
No, it's not because it's seen 'anatomy' for Pelicans, Animals - even how it's represented in Animals.
If you try to get the AI to actually decompose it and start to 'draw pelicans' in very obscure ways, it will immediately fail.
Try to get the AI to draw the pelican form a very odd angle - like underneath, to the right, one wing extended, one wing not ... 0% chance.
Precisely because it does not understand those things.
FYI it's a slightly unfair case because it does not have 'world model' yet, which will actually solve that problem, but even then not through very much abstracting.
We're a long way away - but in the meantime, there's lots to unpack.
Mistral seems to be the exception. Their new model from a few weeks ago is worse then selfhosted gemma.
Once I work out the kinks, I’ll be able to further automate it.
Would have taken 10-100x as long for me to build it without AI and the AI version is probably better.
But yeah, I have enough knowledge to know what prompts are needed and figure out those “oh, I think it’s running slow or failing because of xyz” and further prompt to improve it based on that what I think it should do instead.
And I know where to make slight changes without burning my allotments.
Other domains I am not sure but I've heard from people like Cal Newport that the rate of increase outside of code and math are not as equally impressive
I've certainly had things that Opus fixed using some kind of work around that GPT-5.5 actually solved.
And the difference between the Sonnet/Gemini/DeepSeek tier to the Opus/GPT-5.5 tier is immediately obvious.
Personal opinion we need to focus more on efficiency instead of how large or complex a model can get as that model creeps into more resource requirements. If the goal is to cost a billion dollars to operate than we've really lost the idea of what models are supposed to be achieving.
Proof by existence?
https://gist.github.com/nlothian/50241d34a654fcf0caa280d4475...
Looks pretty good to me. ChatGPT in "Thinking" model.
Edit: I've added the Opus version on the same link.
It's like most people just watching a 'starting nba player' (not superstar, but just starting player) vs one that sits on the bench.
If you were to just watching them play, work out, shoot - you'd never notice the difference.
Put them head to head and it's 98-54 and you start to see the patterns.
It's pretty interesting actually, someone tell me what the 'science' for this is, I'm sure there is some kind of information theory at work here.
Software has innumerable kinds of problems at varying level of complexity and so it provides the perfect testbed for seeing how far models can go in practice.
Should add: you're very right to hint that harness, tooling, and models tuned o both the harness and he kinds of things people do on the harness, as well as some other things do make enormous difference.
Bu and large, SOTA Codex/Claude Code are substantially better - at least for now. That may change.
I think the smart zone stays within the first 100k tokens, no mater if the context window is 240k or 1 million.
I divide the work to fit within that 100k and use subagent for the tasks.
https://chatgpt.com/share/e/6a0bf28b-e198-8012-9a88-c777d965...
AI just changed how I edit code - I still see coworkers (senior developers) failing with Claude/Codex and get stuck when there are trivial solution if you understand the full problem space. Right now AI is just a productivity tool.
Coinbase is paying the price for that for every UX glitch, after the CEO was gleeful about HR personnel shipping production code
It will almost never converge on the general solution that will pass tests you haven't given it yet.
This is why AI is sooo good at Javascript and related slop. A solution that "kinda works" is good enough 9 times out of 10 and if some tests fail well ... YOLO and the web page will probably render anyway.
Contrast that to using Scheme or Lisp where AI will have trouble simply keeping the parentheses balanced.
It would support your point about the performance of 20GB local models.
19th May 2026
I put together these annotated slides from my five minute lightning talk at PyCon US 2026, using the latest iteration of my annotated presentation tool.

I presented this lightning talk at PyCon US 2026, attempting to summarize the last six months of developments in LLMs in five minutes.

Six months is a pretty convenient time period to cover, because it captures what I’ve been calling the November 2025 inflection point. November was a critical month in LLMs, especially for coding.

For one thing, the supposedly “best” model (depending mostly on vibes) changed hands five times between the three big providers.

As always, I’m using my Generate an SVG of a pelican riding a bicycle test to help illustrate the differences between the models.
Why this test? Because pelicans are hard to draw, bicycles are hard to draw, pelicans can’t ride bicycles... and there’s zero chance any AI lab would train a model for such a ridiculous task.

At the start of November the widely acknowledged “best” model was Claude Sonnet 4.5, released on 29th September. It drew me this pelican.
In November it was overtaken by GPT-5.1, then Gemini 3, then GPT-5.1 Codex Max, and then Anthropic took the crown back again with Claude Opus 4.5.
I think Gemini 3 drew the best pelican out of this lot, but pelicans aren’t everything. Most practitioners will agree that Opus 4.5 held the crown for the next couple of months.

It took a little while for this to become clear, but the real news from November was that the coding agents got good.
OpenAI and Anthropic had spent most of 2025 running Reinforcement Learning from Verifiable Rewards to increase the quality of code written by their models, especially when paired up with their Codex and Claude Code agent harnesses.
In November the results of this work became apparent. Coding agents went from often-work to mostly-work, crossing a quality barrier where you could use them as a daily-driver to get real work done, without needing to spend most of your time fixing their stupid mistakes.

Also in November, this happened—the first commit to an obscure (back then) repo called “Warelay” by some guy called Pete.

Over the holiday period, from December to January, a whole lot of us took advantage of the break to have a poke at these new models and coding agents and see what they could do.
They could do a lot! Some of us got a little bit over-excited. I had my own short-lived bout of a form of LLM psychosis as I started spinning up wildly ambitious projects to see how far I could push them.
n % 2 === 0); console.log('Evens: ', evens); var sum = numbers.reduce((a, b) => a + b, @); console.log('Sum:", sum); Output 27 Doubled: [2, 4, 6, 8, 10, 12, 14, 16, 18, 20] Evens: [2, 4, 6, 8, 10] Sum: 55 Execution time: 8.00ms About: micro-javascript is a pure Python JavaScript interpreter with configurable memory and time limits. This playground runs entirely in your browser using Pyodide (Python compiled to WebAssembly). View on GitHub">

That playground demo shows JavaScript code run using my micro-javascript library, in Python, running inside Pyodide, running in WebAssembly, running in JavaScript, running in a browser!
It’s pretty cool! But did anyone out there need a buggy, slow, insecure half-baked implementation of JavaScript in Python?
They did not. I have quite a few other projects from that holiday period that I have since quietly retired!

On to February. Remember that Warelay project that had its first commit at the end of November?

In December and January it had gone through quite a few name changes... and by February it was taking the world by storm under its final name, OpenClaw.
The amount of attention it got is pretty astonishing for a project that was less than three months old.

OpenClaw is a “personal AI assistant”, and we actually got a generic term for these, based on NanoClaw and ZeroClaw and suchlike... they’re called Claws.

Mac Minis started to sell out around Silicon Valley, because people were buying them to run their Claws.
Drew Breunig joked to me that this is because they’re the new digital pets, and a Mac Mini is the perfect aquarium for your Claw.

My favourite metaphor for Claws is Alfred Molina’s Doc Ock in the 2004 movie Spider-Man 2. His claws were powered by AI, and were perfectly safe provided nothing damaged his inhibitor chip... after which they turned evil and took over.

Also in February: Gemini 3.1 Pro came out, and drew me a really good pelican riding a bicycle. Look at this! It’s even got a fish in its basket.

And then Google’s Jeff Dean tweeted this video of an animated pelican riding a bicycle, plus a frog on a penny-farthing and a giraffe driving a tiny car and an ostrich on roller skates and a turtle kickflipping a skateboard and a dachshund driving a stretch limousine.
So maybe the AI labs have been paying attention after all!

A lot of stuff happened just in the past month.

Google released the Gemma 4 series of models, which are the most capable open weight models I’ve seen from a US company.

Also last month, Chinese AI lab GLM came out with GLM-5.1—an open weight 1.5TB monster! This is a very effective model... if you can afford the hardware to run it.

GLM-5.1 drew me this very competent pelican on a bicycle.

... though when it tried to animate it the bicycle bounced off into the top and the bicycle got warped.

Charles on Bluesky suggested I try it with a North Virginia Opossum on an E-scooter

And it did this! I’ve tried this on other models and they don’t even come close. “Cruising the commonwealth since dusk” is perfect. It’s animated too.


Here’s that Claude Sonnet 4.5 pelican from September for comparison.

So those were the two main themes of the past six months. The coding agents got really good... and the laptop-available models, while a lot weaker than the frontier, have started wildly outperforming expectations.