https://sebastianraschka.com/llm-architecture-gallery/?compa...
If you look at it, the diagrams are very similar, but the main differences are that the feedforward is replaced with a MoE (router to multiple feedforwards) and the model has a different attention implementation.
I think the point stands: MoE, a myriad of complex attention approaches, shared layers, you name it. And making it all work together well is a huge trial-and-error pain even for small models, never mind getting to efficient hardware utilization.
The page links to the same site you do. No wonder it is similar -- the source is the same!
The very first sentence
> Back in 2022 and 2023 there were two big branches of machine learning happening at Meta.
is unmistakably human. That's not how a LLM would phrase this sentence, and if it did, it would have put a comma after 2023.
I am a professional writer and have been for over 30 years. (I do not use any form of LLM ever.) This means I read a lot. This also means that I have 30+ years of experience of readers not understanding what I wrote, or not getting further than the title, or not getting the main message, or inverting it in their heads, or inserting their own message and then complaining when I diverge, and an endless list of Ways People Do Not Get It.
I am also a trained TESOL teacher. Ability to capture gist is a skill we test for and measure, and many, maybe the majority, of native speakers don't have it and don't know.
In recent years I constantly see people going "this is written by AI" and I have yet to see a single of of them able to coherently prove their point. It's all just feelings and hunches.
So I am calling you on this:
How do you know? Show your working. Demonstrate your case.
Edit: You know how you can recognise someone just from their gait while they walk towards you? I would struggle to describe that for an individual person but it doesn't mean I can't identify them from that alone.
But AI written pieces do have a certain feeling. A sort of saccatto in the succession of ideas that does not feel natural. They emphasize certain points, and you as a reader, you just wonder why is that. There is the “This thing, not just that thing”. There are also the three successive propositions (mostly in one sentences) to accentuate an idea and “Negation. Strong positive idea in the same direction”.
In general try reading one (vocally) to yourself and it will feel really weird.
Some days, I spend over 4 hours a day reading walls of text written by Claude. If I couldn't recognize Claude's default "voice" by now, something would be wrong. It would be like a Hemingway fan not being able to recognize Hemingway. Except more so, because Claude's writing style is getting worse from version to version, descending into self parody.
On the statistical side, Pangram's model identifies AI-authored text with a 1-in-5,000 false positive rate, measured against hold-out texts from before 2022. My "ear" also agrees closely with Pangram. If I think something sounds AI written, Pangram virtually always comes back with "AI, confidence: high."
> On top of that, LLM writing is often bad in a very particular way: it's weak on actual things to say, but with an overheated style.
This point is interesting because it raises the question of what "LLM writing" actually is. If it is expanding a smaller prompt into a larger article then yes, by construction the information density is low. But it can also be used to take a semi-coherent stream of consciousness and turn it into something readable and the people using it that way might already have started to slip under the radar.
This is a lot like how the criminals seem especially stupid because the ones who get caught are disproportionately the stupid ones. The easily detectable LLM writers are going to be the lazy ones.
Back in 2022 and 2023 there were two big branches of machine learning happening at Meta1. The LLM work that led to Llama was a clean, smooth stack of repeated Transformer modules; the recommendation systems graphs were, by contrast, terrifying. Luckily, the industry has remedied that state of affairs by making LLMs a lot more complicated.
Seb Raschka maintains an excellent gallery of model architectures. You can use it to diff two of the best open models of their respective eras, Llama 3 and Nemotron 3 Ultra.

Attention might be all you need, but modern models certainly use a lot of different variants of it: query grouping, compressed, sparse, linear, sliding-window and more. Mixture-of-Experts added selective routing to feed-forward layers, and we have since started routing just about everything else too, from attention blocks to the residual stream. Vision and audio encoders have gone from bolted on to mixed-in, and models have scaled to run at inference time across multiple GPUs, which throws comms ops in that add extra boundaries in the middle of your model.
This is not too different from what happened with recsys. The basic architecture of recommendation systems, for the best part of a decade, was a relatively straightforward two-tower sparse neural net. The complexity came from the tension between the need to continually increase capabilities and the need to stay efficient, particularly for inference.
It’s tempting to assume that agents will Fix This: that you’ll hand your PyTorch or JAX definition to Claude Telenovela or whatever and have it generate optimally fused kernels2. To make that work you need a fixed, usable baseline to make sure that what is generated is… right.
What happened with recsys was that the gap between performance being an optimization and performance being a necessity became very, very small. Conceptually you can keep a pure model definition that gives you a baseline; in practice, training and testing a model takes significant resources and performance improvements become load-bearing.
If you want to swap attention variant A for variant B, you can afford for B to be ten percent slower. You probably can’t afford for it to be an order-of-magnitude worse. If A is fused and optimized, you need at least a partially fused and optimized version of B before you can even tell whether it’s worth exploring. The research iteration loop demands a different kind of flexibility than just “optimize this known quantity”. You can’t hand-fuse your way back without investing significant time that might not be worth it, and you can’t generate your way forward without a baseline to check. The only way out is to design for composability up front.
One of my favorite kernel developments of the last few years was FlexAttention in PyTorch, which took a whole class of attention operations and allowed you to generate kernels for them, via Triton templates. It built on a huge body of work in attention kernels, and it was designed to be composable and verifiable up front: you can explore with only a very mild impact to performance.
Andrej Karpathy recently joined Anthropic, in part to develop richer auto-research-style loops at the frontier. As he has spent the last few years showing, though, being able to cut architectures to their essence and make them composable is as important as a clever agentic setup in climbing that kind of hill.