1900 - 2010 https://www.thekurzweillibrary.com/exponential-growth-of-com...
1939 - 2023 https://medium.com/@timventura/kurzweils-law-for-the-ai-age-...
even having something like opus 4.8 locally would completely change the landscape
The AI companies owe use money. As does e. g. NVIDIA for becoming a cartel.
If AMD is competitive performance per watt and roughly reliable in terms of software support which is what most folks outside of US prioritize above all else, since outside of China and US electricity tends to at a relative premium.
Maybe if they make smaller data centers viable at the right price, AMD could be part of the stack outside of US where ever Nvidia is more limited in supply. Though I have genuinely no idea what sourcing an AMD GPU looks like.
I have never seen a company use AMD outside of wafer and a couple others mostly in US.
Genuinely intriguing or maybe not really (could be this stuff is common knowledge) and I am just stuck in my Nvidia bubble here.
If I'm missing something, please let me know!
It's genuinely neat that AI can find the right optimization pathways in an AMD inference server to unlock this but at the same token (pun-intended) this is a classic case of benchmark hacking that doesn't stand up to real-world application.
So much compute is under utilized waiting for a savant or company to prioritize an architecture, and now all the other engineers can tackle this at any time if they get inspired on the right prompts
I haven't tested enough models Nvidia has converted to NVFP4 besides GLM 5.2 but it seemed fine to me.
My own luck has been hit or miss with it.
[1] https://www.ycombinator.com/launches/Q9i-wafer-pass-flat-rat...
If you plan to run it straight for 8 years 100% max usage thats around 1 GWhr.
A gigawatt hour is a lot of energy but its not that much compared to the price of the actual machine. In Germany for example with its expensive energy thats about €100k worth, which spread over 8 years is pretty minor compared to the up front half mill.
The real issue with high power consumption is not really the cost of energy but the limited powersupply you can get for a datacenter. A more efficient setup is highly desirable because it means you can fit more in the limited power hookup.
There's a few using them, and even more starting to experiment with them. AMD has long been a source of disappointment around this side of things, so I'm hesitant to feel optimistic we'll finally get some competition. The market really needs viable competition to Nvidia, especially performance/watt.
* What does it mean for "performance per dollar" to get faster? Higher, maybe; rise faster than it has in the past, maybe, but just "faster"? Nope.
* The article cites some equipment as being "2x cheaper". I think they mean "half the cost", but if so they should say it.
I guess you really do have to try it at least for some time to actually know
Meta is using AMD: https://www.amd.com/en/newsroom/press-releases/2026-2-24-amd...
And OpenAI: https://www.amd.com/en/newsroom/press-releases/2025-10-6-amd...
Worth remembering AMD basically "owns" (not literally) the hardware-side of things in video games consoles for good many years now, with no end in sight.
Just because you haven't seen it doesn't mean it doesn't exist.
We've serviced over 700 customers on our MI300x.
If it's efficient and the power costs of not just ongoing costs but the upfront setup is lower that makes a lot different scales of data centers practical, especially for inference which doesn't need massive super clusters.
You can't just fire up gas turbines everywhere like US Data centers are doing. I am not even sure if that's legal in US...
Note you have to plan for peak usage and a lot of stuff large scale data centers are insane infrastructure projects.
Nvidia is both supply and price constrainted, sure if you are willing to pay over 0.5M$ you might get some, but if you try to balance out price to costs by going slightly lower on the pole you realize just how much more expensive Nvidia truly feels like AMD has a lot of margin to under cut them if they want to.
The article is highly qualified but the headline is not. If they are not making general statements then they shouldn't open with them.
Since many people haven't seen 10MW cabling for a data center or how a big GPU server is cabled, they naturally imagine connecting servers is akin to plugging an appliance to a wall.
When the electricity provider says "I neither have the capacity, nor the required cables in that area", thing gets real.
But it's meta they can get a GW up of AMD in a year
I have a huge EPYC based data center like 200-300+km from my house on the outskirts of the city a few dozen miles from a IT industry tech park(place with lots of IT company offices).
Would love more calculations on that
(Previously this comment said Rubin did native NVFP4, but Blackwell does too! Rubin just also trains with native NVFP4, which Blackwell does not.)
So I could see something like this where the neural chipset has an LLM that cant be so easily updated baked into it, until you get a new device
I also think the dynamic would be really different if model inference can run at ridiculous speeds. You could make a genetic algorithm loop around it, so it can generate a population of proposals at each step, then have those tested and whittled down iteratively. If inference happens at thousands of tokens per second, then from user perspective it would still be really fast, and even a small model could solve complex problems.
See PUE (Power Usage Effectiveness) for its scientific form.
They instead start the build out and plug in stuff they can, then take a loan or ask Nvidia to help fund it. (I am not joking)
I believe the case is if you can prove to Nvidia you can install and provide more Nvidia capacity they help out because more Capacity going online today is in the best interest of Nvidia.
Spot prices of Nvidia GPUs going up is not good news for Nvidia btw. The people renting Nvidia has the least amount of friction in moving off Nvidia, especially with AI tools you could build and get up to speed with AMD stack much sooner...
So if Nvidia is truly not an option and you entire company is not a bet on Nvidia then you will move off but only as a renter not as a buyer unless they truly can't fund Nvidia I suppose.
But again I repeat if you build a datacenter and provide good enough base Nvidia will help fund you to a mostly complete data center.
People might not like it but that's the reason Nvidia is so unreasonably dominant even now when otherwise given the scale of investments it might have been cheaper to look for alternatives.
This is why Nvidia doesn't like the China stack.
love to debate actual discission points. pull up "datacenter dfw" on google maps for mine.
Have you noticed we like AMD?
The demand for inference is skyrocketing and outpacing supply. With frontier models being released almost every other week — Claude Fable, GLM5.2, and Minimax M3, to name a few — the token craze is only getting crazier, and there aren’t enough Blackwells going around to support it. Thus, NVIDIA GPU prices are climbing fast, and tokens are getting really expensive.
In comes AMD. At around 2.75x cheaper per GPU on average (MI355X vs B300) with comparable hardware specs, the solution to cheap inference is hiding in plain sight — a message we at Wafer have been preaching for months. But although AMD’s Instinct MI350 series competes with Blackwells at the silicon level, NVIDIA’s software advantage and day-0 support typically allows providers to serve inference much faster on their hardware with much less friction.
Conversely, on the MI355X / ROCm stack SOTA performance rarely comes out of the box for these frontier models (sometimes it does!). In fact, you’re lucky if you can find an image that runs them at all. Without this day-0 support, building and optimizing for the newest models can require weeks of engineering and compute. By then, the newest model has already been released, making it so AMD is always playing catch-up.
But as agents improve at kernel and model optimization, this gap is closing in real time. At Wafer, we’ve proven this time and time again.
And again — on a 20k in / 1k out, 60% cache hit rate workload, we hit an aggregate throughput of 2626 tok/s/node @ 2.4 rps with a defined knee of ≤5s TTFT — only 80% of the performance measured on a B200, despite being over 2x cheaper.
| Sustained RPS | Aggregate tok/s/node | TTFT p50 / p95 | Success |
|---|---|---|---|
| 0.5 | 449 | 0.59s / 0.60s | 100% |
| 1.0 | 974 | 0.60s / 0.81s | 100% |
| 1.5 | 1913 | 0.62s / 1.03s | 100% |
| 2.0 | 1944 | 0.62s / 1.05s | 100% |
| 2.25 | 2089 | 0.63s / 1.23s | 100% |
| 2.4 (saturation) | 2626 | 0.81s / 2.22s | 100% |
We also hit 213 tok/s on GLM5.2 on 10k input tokens / 1.5k output tokens single stream, following Artificial Analysis standards, served on AMD MI355X capacity from TensorWave. Though this number doesn’t top the AA leaderboard, it still wins on performance per dollar.
The first step with any model work is to choose a quantization and framework. We quantized the base bf16 GLM-5.2 to MXFP4 with AMD Quark. In comparison to z-ai’s official FP8 quantization, our MXFP4 was lossless (GPQA-Diamond, tau2, GSM8K).
| Eval | FP8 baseline | MXFP4 | Δ (MXFP4 − FP8) |
|---|---|---|---|
| GSM8K (200q, 5-shot, greedy) | 0.965 ± 0.013 | 0.955 ± 0.014 | −0.010 |
| GPQA-Diamond (198q × 2 seeds, temp 1.0) | 0.9217 ± 0.027 | 0.9026 ± 0.029 | −0.019 |
| tau2 macro | 0.819 | 0.834 | +0.015 |
As for the inference framework, we had three options — vLLM, ATOM, and sglang. Among the three, we chose sglang — vLLM had no working MXFP4 + GlmMoeDsa path so the MXFP4 weights provided no benefit, and ATOM’s output degraded at long context. Sglang was the inference engine with the least friction to native support, able to take advantage of the quantization while remaining coherent.
The next natural step to improving throughput was enabling speculative decode on sglang. However, the sglang ROCm image does not support this out of the box. There were two fixes needed before MTP worked properly.
First, the MTP head, like every other layer, keeps its single shared expert stored in bf16, not MXFP4. However, the MTP head is registered under a different module prefix than the main decoder stack (Quark names its bf16 shared expert model.layers.78.mlp.shared_experts.*, while the MTP layer’s real prefix is model.decoder.*). Because of the mismatch, sglang’s quantization lookup fails and defaults to building that shared expert as MXFP4. At load it then tries to read a full-width bf16 weight into a half-width 4-bit slot and the init crashes on a shape mismatch. Quark records which weights to leave un-quantized as a list of layer names, so we copied over the layer 78 entries to that list a second time under the decoder name sglang actually uses. This fix unblocked speculative decode, netting us close to a 3x gain in single stream throughput.
Second, deep speculative decode (such as the 5/1/6 config z-ai suggests) was still blocked. The fused multi-step metadata kernel needed for draft depth ≥4 writes #include <cuda_runtime.h> with no ROCm guard. Fix: one #ifdef USE_ROCM guard.
Two trivial, but necessary changes to take full advantage of speculative decode. With spec dec working properly, alongside a few config optimizations (such as --kv-cache-dtype fp8_e4m3 and --enable-aiter-allreduce-fusion), we reached our headline single stream decode number at 213 tok/s.
But for aggregate throughput, especially with our defined workload, decode optimizations are necessary but insufficient. At 20k in @ 60% cache, the workload is primarily prefill bound.
At TP8, which was the configuration optimized for single stream decode, the MI355X can run GLM5.2-MXFP4 at 1461 tok/s/node. Switching to TP4×DP2 netted a massive improvement on this workload, getting us to 1944 tok/s/node at 2.0 RPS — still relatively slow compared to our measured Blackwell performance, which hit 3192 tok/s/node at 3.0 RPS. A big reason for the poor prefill performance on the MI355X is that on the sglang image, GLM-5.2’s fp4 MoE was silently on a slow FlyDSL heuristic fallback (aiter only shipped tuned configs for the a8w8/fp8 path). We tuned the MoE kernel selection ourselves on GLM’s fp4 shapes (model_dim 6144, moe_inter 2048, E=256, topk=8), which allowed us to reach 2626 tok/s/node at 2.4 RPS. Much better.
Although there was some degree of friction, achieving the best performance per dollar ratio on the MI355X wasn’t particularly hard — though there were some framework related bugs, unlike our work with Qwen3.5 397B, you’ll notice that we didn’t actually write any custom kernels this time. Though this study doesn’t take multi-node performance into consideration, single-node deployments still remain highly prevalent in practice.
SOTA on AMD is becoming more a matter of support, not software. The CUDA moat is eroding in real time.
https://youtu.be/_bP80DEAbuo?is=sg09k66iutKFIFSo
Yet here we are, discussing "data center" as if they're standardized and of similar (nose) isolation.
There are no meaningful regulations in building them, and they can be incredibly polluting. So your experience with a potentially well isolated one is sadly not the norm going forward. And we don't even know how close you lived, if you're eg talking about "within 5km/3miles" then your experience would also have little value in this discussion in general.
Of course there are techniques such as quantization aware training but I don't understand why a datatype would work for inference but not for that.
You can also abandon backprop entirely but that comes with a whole host of tradeoffs and again why would it work for inference but not for whatever alternative training regime you selected?
Nvidia says Rubin should have fewer stability problems training with FP4 because of hardware changes - "adaptive compression". There will still be outlier instability inherently, but something they're designing in reduces the cost of managing it.
But yeah, grain of salt - we haven't seen this in practice.
If a municipality doesn’t have emissions, noise, water use, etc regulations, that’s a serious failure in governance.
We don’t need nor want the word “data center” in regulations anymore than we need the word “abattoir.”
The names of the things we build change all the time. Their impact on their communities don’t.
We need to regulate impact, not the name or type of business.
If we did, nobody would know or care about data centers and they wouldn’t be affecting their communities, because they’d be operating under established impact regulations.
Can you cite a source for this? It's not in the video, as far as I can tell.
I would be wary of Benn Jordan's videos. They are full of mistakes and misrepresentations, as Andy Masley has convincingly demonstrated: https://blog.andymasley.com/p/contra-benn-jordan-data-center...
I recall seeing Benn Jordan's responses on Bluesky and thinking they were quite poor. He was unwilling to admit to mistakes, and kept trying to grasp at newly searched papers that didn't actually support his arguments.
Indeed, he shot himself in the foot there pretty bad, but I would argue that that was just the result of successful Agitation.
I would personally strongly prefer being in the same room with Benn compared with Andy, because one of them is authentic, while the other is calculating. Though, arguably, Benn has been catching up on that lately too.
But yeah, taking stuff with a grain of salt should be the default regardless of the person speaking.
10 years ago, I was running 4 CPU servers with 48 cores and 128GB of RAM in 2U enclosures with a maximum power consumption of 500W or so. I was able to stick ~20 of them in a 42U rack, totaling 10kW.
A data center full of these can be cooled with CRACs and hot/cold aisles without much problem. This is still too much for a bog-standard server colocation operation, but for HPC, that was normal and manageable.
Now, a ~1U server houses 4 SOTA NVIDIA GPUs, 64 cores, magnitudes more RAM. This server alone uses ~3KW of power. This means you go anywhere between 30kW to 50kW per rack, and you have many racks.
Of course this means more power comes in, more heat comes out. This means more sophisticated infrastructure: bigger and beefier primary and secondary power systems, beefier cooling, more heat, more noise, in short "more of everything".
Of course when you cram this much energy and heat into a relatively small space, its effect on the environment will be much more pronounced.
Facebook's previous SOTA datacenter used water infused, HEPA filtered free flowing air accross the datacenter. Now, it's server level direct liquid cooling with extensive water treatment and oversight on coolant parameters.
Compare this having a hand warmer vs. coal ember in your hand. The latter needs a much more elaborate setup to prevent it burning you badly.
You can stuff GPU servers into existing buildings- but even with significant upgrades you end up with a lot of empty space on the floor that can't be used.
1. Article is about AI, so I have given the example for an AI datacenter.
2. In pure CPU datacenters, the power dynamics do not change much. I can add more servers to a single rack, but the rack power is again in the 30kW to 50kW range, so you're planning and building for the same power capacity.
> You can stuff GPU servers into existing buildings-
Yes.
> but even with significant upgrades you end up with a lot of empty space on the floor that can't be used.
Yes & No. It's not impossible to convert an old datacenter to support ~35KW/rack capacity, but it's not cheap, and you'll have more worries than holes, piping, building and power. Namely, can your floor handle that much weight to begin with?
If we had regulations on noise, vibration, emissions, water use, electromagnetic radiation, whatever else, then it wouldn’t matter what people tried to build — if it fits within the guidelines great, otherwise back to the drawing board.
Putting “data center” in your ordinances is as lazy and ineffective as putting “abattoir.”
Sane jurisdictions do have regulations regarding these things. Not all jurisdictions are sane, some of them are run by people who sell out their residents.
Suburbs and cities around me all have noise regulations, my state has its own pollution regulations, and the local water utilities don’t hook up customers that stress the system. Unfortunately there are places like Texas, Tennessee, Louisiana, Mississippi that don’t give two shits about their citizens and let companies run temporary natural gas turbines permanently and all kinds of other nonsense.
We certainly do! It’s just often overridden and ignored for these companies and data centers
This does sound plausible, but it's also pretty sad and not a sign of a healthy democracy