I have no doubt that we will have another wave of cheap retired server gpus just like before. And that is the time when everyone will have their own models at their home.
Or we can just buy the newest medusa halo mini pc. they will be pretty decent, too, albeit pricey.
I ended up getting a modern 26B MoE model (Gemma 4) running at reading speed on an old recycled server with a single Xeon E5-2620 v4 and 128GB of DDR3 RAM (and no GPU). It took a lot of work, but it actually worked out somehow.
I've also linked the quants at the end, but they're not gonna run unless you use the ik_llama-cpp fork I mention, see other posts for more details.
I'm not an ML engineer, so I'm by no means an expert, and the server is busy acting as a Nix cache, but if you have any question, I can try to answer, but best effort.
EDIT: I stand corrected, 200W is apparently way too high of an estimate. I used to run a bunch of old Xeon servers and they slurped watts like crazy, but I can't remember which ones exactly those were.
Here's my setup. You may want to figure out what the best optimizations are for your specific CPU like AVX2 because mine didn't have most of them. I did try MTP briefly but I wasn't getting performance improvements. You could play around with the batch sizes for cache or context or go even lower for Q2 and don't overcommit on threads either, but I would suggest either defaults or trying out llama-bench. This isn't by any means the best I assume but it worked decently for me and I sometimes swap out Gemma for Qwen. You could also lower q8_0 to q4_0 for more context but it could hurt quality some say, altough I have noticed it too on some models.
# Building
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=ON -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS -DGGML_OPENMP=ON
# Running
export OPENBLAS_NUM_THREADS=4
export OMP_NUM_THREADS=4
OPENBLAS_NUM_THREADS=4 OMP_NUM_THREADS=4 \
llama.cpp/build/bin/llama-server -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL --temp 1.0 --top-p 0.95 --top-k 64 --min-p 0.00 --jinja --host 0.0.0.0 --port 8080 --cache-type-k q8_0 --cache-type-v q8_0 --threads 4 --threads-batch 4 --ctx-size 8192 -n 8192 --batch-size 2048 --ubatch-size 512 --no-mmap --mlock --chat-template-kwargs '{"enable_thinking":false}' --no-mmproj -np 1 -fa 1
An impressive effort, and better than I would have thought possible on this hardware -- but still pretty far short of what one needs for an satisfactory interactive session.
I wonder if I could get similar or even better performance from similar Dell T7610 workstation with dual Xeons and also 128GB DDR3?
The CPUs are better core wise, but that probably does not make much difference?
It has CPUs 2 × Xeon E5-2697 v2
Cores / threads 24 cores / 48 threads total
Per-CPU cores 12 cores / 24 threads
Base clock 2.70 GHz
Max turbo 3.50 GHz
It is sitting gather dust but reading spead Gemma sounds promising.
Which makes sense I suppose.
As it is, the title is click-bait for me, as 1) it says I need at least a Xeon somehow and 2) as it doesn't say what I actually need it for.
Should the market react to the memory shortage, the progress of the Apple silicon continue at the same pace, and what we’ll be able to run locally in 6 years will be very exciting. or frightening.
Also I don’t know what this means for the valuation of the AI companies. I remember asking about this very idea to one of their employees at an event and instead of answering he bailed out to grab a cocktail.
But ... isnt that a classic use case for SMT? Giving T1 sth. to do while T0 is waiting on DDR(3) and vise-versa?
I also dont understand the explanation of "--cpu-moe". If an expert has ~ 4.0 GiB of Parameters, why does optimizing the sequence of experts minimize cash trashing? With 20 MiB of L3 Cash vs 4.0 GiB of Parameters, it wont cash any noticeable amount of the Parameters, will it?
As mentioned by others, only some Intel Xeon E5-2xxx v4 did support DDR3, and according to Intel, the E5-2620 v4 is not one of them.
CPU(s): 32
On-line CPU(s) list: 0-31
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
Also with 128G. Does 8 dimm sockets imply more actual bandwidth in practice?This poor thing is currently a YouTube watching box.
Guess I am a species-ist after all ;)
It's still a "homelab" beast and does great with development and GIS/Mapping applications. I was not able to figure out how to run AI workloads on it with decent performance, however, so I finally broke down and got a dedicated GPU for it. It's pretty great what can still be done with older hardware.
What was the net effect of the optimisations? How much faster did it get?
High-Performance AI on a Budget: Optimizing llama.cpp for Qwen3.5 Inference on a Dual-GPU HP Z440
In my opinion, the bottleneck is the package management layer and not the model capabilities and performance.
I have been an avid Linux user for decades, and if I find it confusing and painful, something is missing.
You say it runs "at reading speed". Have you benchmarked it?
So either you have a v2 instead of a v4 (and run on DDR3 memory), or you have a v4 but with DDR4 memory (not DDR3)
Everything else doesn't work
For a long time, too. Programming languages rarely change much, techniques rarely change, so I should be able to use said model for I hope at least five years; and if at any time they optimize local models to cram even more intelligence into the same amount of VRAM, I can upgrade to that.
I like this path.
Paradoxically, the better results we get from general harness of coding agents, the less moat Claude and co. get. It's unbelievably how fast some open models outpaced frontier models of just a few months ago.
Sadly - it's going to be ads. Advertising is going to get in there and enshittify the whole thing because as always, advertising income is too easy and too plentiful for any company to resist.
Right now the models are fairly agnostic, but we are a hair-breadth away from ChatGPT responding with, "the right tool for this job is a circular saw - something like the Milwaulkee M18, which happens to be on sale at Home Depot this weekend."
ik_llama includes llama-sweep-bench https://github.com/ikawrakow/ik_llama.cpp/blob/main/examples...
When comparing hardware, the output of these tools is very helpful to let others put it into context. The post says the output is "reading speed" but knowing the prefill and token generation speeds would be a lot more helpful.
Admittedly web browsers and it don't get along that well. Literally the only thing that drags though on my Slackware 15 system, and even then usually only when it gets to around 15 or so open tabs.
https://www.techpowerup.com/cpu-specs/ryzen-7-4800u.c2281
It is way too slow
https://pcpartpicker.com/products/motherboard/#s=20028,20029...
(He has a fully maxed out “last Intel” Mac Pro and laments the lack of replacement).
Plus many boards also support CXL for RAM expansion over PCI 5!
Source: building a hybrid inference business for regulated industry workloads.
numactl --membind=1
so it is constrained to one of the memory sticks which speeds up token generation a little.
At the low end, I'd use old Xeons with gobs of DDR3, install some V100s, run a smaller agent for general chat inquiries, and a frontier model for the deeper stuff, with a router that passes between them depending on the complexity.
The frontier model would perform very slowly, but if it's a deep task the user can submit it in a batch in the evening e.g. "Correlate all of these cases and look for patterns" then receive the output with morning coffee.
Of course, AI helped me work out a plan for this. Haha
Also I feel like everyone leaves off prompt processing/prefill speeds in these articles. If you are using a very small prompt and asking for mostly generated tokens, sure but I'd love to know the time-to-response of asking for an analysis of an image or a few hundred lines of code.
I'd love if anyone knows how I might fare with an old Dell R710 with 2 x Xeon 5600 (12 cores total) and 96Gb of DDR3.
I use LM studio and qwen3.5 35B - but never figured out if it is swapping or not.
Om am unrelated note, does anyone know a model that can help with this use case:
There's a lot of budget hosting built around chips like these, and they're suprisingly power efficient.
Probably nothing. Most users have no idea what an LLM is or how it runs. Anecdotally speaking, I see many LLM users default to whatever their day job provides to them. And even slightly more sophisticated users seem ok with paying for their openai or anthropic subscriptions.
Maybe we will see a small but dedicated group of open weight model users who prefer local llm, but everybody else will just consume from the big providers? The scenario might look something like OS choices today - a small, committed group of Linux users vs the vast majority of other users running Windows, MacOS, or Chrome?
I'm shoehorning it back in the Optiplex that donated the ram, so it's not ready to go at the moment, but when I had it running on top of the motherboard box as a test I ran the (9B?) gemma4:e4b-it-q4_K_M since it can fit entirely in the 11gb vram. It flew, more than 50tk/s. A model that small isn't useful for coding, but there could be uses. I'd love to figure out a Wake-on-Use and use it as my personal ChatGPT. I'm not sure how that would work... Maybe proxy the LLM thru a Pi with a script to Wake-on-LAN the PC? It'll be a fun weekend project someday.
My always-on LLM is the dense Gemma4:31b that's not quite half in GPU on a 12gb 2060. It's really slow, but the quality is great and my use case is an automated queue so I'm not sitting there watching the output. I have another 2060 but unfortunately the PC won't POST with both installed for some reason.
If you get a not-quite-the-best gaming GPU like a 5080, you can run local models that are better than the state of the art from early 2025. Depending on what you want to do, you might have to switch models. The one size fits all huge models are still a data center thing.
I also had no idea RAM and GPU costs would explode they way they did, just happened to do it the right time. I might try to grab a ~$300 3080 on Ebay and sell the 1080ti, but otherwise it's been a great upgrade -- it sucks electricity like Coca Cola, but otherwise performs fantastic as a workstation, and I'm just gonna drive it til the wheels fall off.
One node's ipmitool sensor report (and self-monitoring PSU, so grain of salt, but my UPS side monitoring tracks closely), reports 250-300w average power use. This though, mind you is for running 22 spinning disks, 2 SAS/SATA SSDs, and 4 NVME ssds, and 768GB of DDR4.
Mid-gen 2015ish Xeons were not great at power reduction, but if you are pegging the cores, they were never particularly slow, and they did have lots of PCIe lanes. This boils down to the CPU/mobo itself not being that big a cost floor, especially if you have high utilization rates.
As a comparison, my main desktop development machine, running a Threadripper 9970X, 128GB of DDR5, a RDNA4 GPU, and a small pile of NVME drives has a power floor of roughly 250W. Some CPU centric workloads you'll definitely lose out on on the older gens of machines, but they are by no means impractical.
Maybe for a desktop usecase they are absolutely suboptimal nowadays, but for a lot of realworld usecases I would say they're still relevant.
---
Like the author posts for the LLM usecase, I think optimizing the hardware choice to the application and not leaving levers unpulled is a big key, especially considering how wide a variety of bandwidth/power draw/peak frequency/corecount SKUs exist in the Xeon lines. Without knowing what you intend to run and fitting the correct processor to it, you will end up with a disappointingly poor environment fit.
https://www.intel.com/content/www/us/en/products/sku/92986/i...
I experiment with all of the local models I can fit into 32GB of VRAM and I have subscriptions to multiple SOTA providers.
The difference between them is very large, unfortunately. The local models can handle small tasks and refactoring mostly okay, but doing anything challenging with them becomes a waste of time. Unfortunately the waste isn’t immediately obvious because they will come back with something that looks like it works, but then on closer examination I need to throw it out and reset them in a usable direction.
Yup that's odd... I've got a Xeon 2680 v4 (14 cores) (amazing bargain of a little beast btw) and it's indeed on DDR4 and I saw all Xeons v4 as supporting DDR4 only.
Full spec (brand/model/mobo type) would have been nice: mine's an HP Z440 workstation repurposed as a server (which I only turn on when I'm working and which I religiously turn off before going to bed).
Enough to validate repurposing an existing workstation with enough RAM, or finding a used high VRAM GPU, or in my case buying a Strix Halo system for home lab and local models.
The future is once again not cloud based, for AI tools.
Published on June 01, 2026
17 minutes read
The previous post covered getting Gemma 4’s MTP drafters quantized and paired with a verifier. This one is about running the result on a machine that has no business running it.
I have a recycled server. To its credit, it has a whopping 128 GB RAM, but it’s DDR3… That RAM is 5-6 times slower than the current best laptop ram. It also has a single Intel Xeon E5-2620 v4 from 2016, which is about 5 times slower than my laptops CPU…
Oh, and as I did mention, we have no GPU. And no, the Xeon does not have an integrated GPU.
But, just hear me out…
If we were to just break out ollama here, well… as explained in earlier blog posts, we can’t. And we’d be lucky if we could in 6 months when they add support for the model we need, if they ever do. Might be they never do. And even still, ollama simply doesn’t expose enough knobs for us to ever make this run well, neither does even the standard llama-cpp.
But. Why would that stop us?
I’ve recieved feedback that some of the previous posts were too high level, I’ll try to make things as clear as reasonably possible here. If you’re a tech worker, or a Linux enthusiast that has built a computer and used something like ChatGPT, most of this should be approachable.
So, just to really set the stage fully. The hardware, per lscpu:
For LLM inference, memory bandwidth is the limiting resource. Every token generated requires hauling gigabytes of weights from RAM into the CPU cache.
When you use a tool like ChatGPT and watch the text stream onto your screen word by word, you are watching the “decoder pass”. During this phase, the model generates the output one piece (or “token”) at a time.
In this step, the system’s raw processing power is rarely the bottleneck. Instead, the limitation is memory bandwidth. To calculate that next word, the processor has to constantly pull massive amounts of data. That data is the “weights” that contain the model’s learned knowledge. It moves this from memory into the compute cores.
The processor executes the required matrix calculations so quickly that it is left sitting idle, waiting for the hardware to physically move the next chunk of weights across the memory bus. In traditional software terms, decoding is heavily memory-bound, not compute-bound.
This is the so called “memory wall”, one of the single biggest performance hurdles now, whether you’re on a Xeon or an H100.
Naively running llama-cli on a DDR3 machine without a GPU is horrendously slow, even if it can run it, because it’s optimized for a generic GPU usecase, and often leaves a lot of improvements on the table. Further, it simply doesn’t have most of the actual optimizations that the state of the art currently uses to run these at scale.
The remedy is to pull every optimization lever ik_llama.cpp exposes. Most of them are slightly obscure.
Here is the magic spell that makes this actually run.
llama-cli \
--model gemma-4-26B-A4B-it-Q8_0.gguf \
--model-draft gemma-4-26B-A4B-it-assistant-GGUF/\
wikitext-2-raw_ik-llama-mtp_drafter-conservative/\
gemma-4-26B-A4B-it-assistant-Q8_0.gguf \
--spec-type mtp --draft-max 3 --draft-p-min 0.0 --spec-autotune \
-cnv --color --jinja --special \
-sm graph -smgs -sas -mea 256 --split-mode-f32 \
--temp 0.7 -t 8 --parallel 8 \
--cpu-moe --merge-up-gate-experts \
--flash-attn on --mla-use 3 \
--mlock --run-time-repack --no-kv-offload
Under a blackbox tool like ollama you never see this line. On aging hardware you have to understand what each flag does, because half of them won’t take, and the engine will tell you so in passing.
Speculative decoding.
--spec-type mtp --draft-max 3 --draft-p-min 0.0 --spec-autotune
This pairs the 26B verifier with the small drafter from the previous post. Up to three tokens per draft (--draft-max 3), all probabilities accepted (--draft-p-min 0.0), --spec-autotune adjusting the chain length per workload.
This ties directly back to our previous discussion about the memory-bound decoder pass.
When a model uses a long reasoning chain, it is generating those “thinking” tokens one by one. Even if the internal reasoning is hidden from the user and all you see is a short final answer, the hardware still has to perform a full decoder pass for every single token in that hidden chain.
In fact, speculative decoding is currently one of the most brilliant software workarounds the AI industry has invented to bypass the “memory wall,” and spec autotune is how you squeeze the maximum speed out of it.
The argument for speculative decoding is stronger on CPU than on GPU. CPU compute is cheap relative to the cost of streaming the verifier’s weights through cache, so spending extra cycles on a tiny drafter whose active layers easily fit in L3 buys tokens at very little marginal cost. The drafter’s working set fits in L3. The verifier however spills out of everything.
CPU and MoE routing.
--cpu-moe --merge-up-gate-experts -t 8 --parallel 8
Gemma 4 26B-A4B has 128 experts with 8 active per token, giving about 3.8B active parameters out of ~25.2B total. --cpu-moe tunes the routing for CPU cache hierarchies.
CPUs handle memory very differently than GPUs. While a GPU has a massive pool of ultra-fast High-Bandwidth Memory (HBM), a CPU relies on small, lightning-fast “caches” (L1, L2, L3) built directly onto the processor chip.
In an MoE model, constantly jumping around between 128 different experts can cause “cache thrashing”, where the CPU constantly has to dump its cache and fetch new weights from the much slower main system RAM (normally DDR4/DDR5, we’re on DDR3!).
This flag tells the router to be smarter about how it picks experts, optimizing the sequence so the weights stay neatly inside the CPU’s local cache for as long as possible.
--merge-up-gate-experts fuses two per-expert projections into a single matmul, which the logs confirm:
fused_up_gate = 1
This is a software trick to bypass the memory bandwidth bottleneck we discussed earlier.
Inside the experts, the math operations require data to be passed through different layers. Normally, the processor would calculate an “up projection”, write the result to memory, then load the weights for a “gate projection”, calculate that, and combine them. That requires moving data across the memory bus multiple times.
Instead of doing two separate trips over the memory bus, it combines the operations into a single step.
-t 8 matches physical cores. The machine has 16 SMT threads but only 8 cores. On a memory-bound workload, oversubscribing threads adds scheduling cost without adding throughput: the cores are waiting on DDR3, not on each other.
Memory pinning, repacking, KV cache.
--mlock --run-time-repack --no-kv-offload
--run-time-repack reorganizes weight matrices in memory immediately before inference to match the CPU’s cache layout. The logs confirm:
============ Repacked 265 tensors
Processors have their own ultra-fast, built-in memory called caches (L1, L2, and L3). However, these caches expect data to be fed to them in very specific shapes and sizes.
If the AI’s weight matrices are sitting in system RAM in a generic layout, the CPU has to awkwardly pull the data in pieces, resulting in “cache misses” where the CPU stalls. --run-time-repack tells the engine to spend a few seconds during startup to physically reorganize the massive tables of numbers in the RAM so they perfectly align with how the CPU wants to ingest them. It pays a small time penalty upfront to guarantee maximum memory bandwidth during the actual text generation.
--mlock is meant to pin the model in RAM so the OS cannot swap any of it to disk.
mlock stands for “memory lock”, suprising, I know! In standard operating systems, if the system starts running out of RAM, it will quietly take data that hasn’t been used in a few seconds and “swap” (or page) it to the physical hard drive.
If an OS tries to swap out 27GB of AI weights to a disk, the generation speed will instantly drop to zero while the system chokes trying to read it back. --mlock tells the Linux kernel: “Pin this 27GB strictly in physical RAM. Do not ever move it to the disk.”
Notice that if you’re not careful, you’ll see this:
warning: failed to mlock 27628376064-byte buffer
(after previously locking 0 bytes): Cannot allocate memory
Try increasing RLIMIT_MEMLOCK ('ulimit -l' as root).
The flag is fine; the kernel-side memlock limit isn’t set high enough to pin a 27 GB buffer. This is not an LLM-shaped problem at all — it’s a ulimit default — and it’s the kind of footgun the blackbox tools paper over by simply not asking for the optimization in the first place.
Consider that for a moment, that many tools by default will just have no problem putting your model into swap if it decided that’s the best option. You can imagine how much this can hurt performance…
--no-kv-offload tells the engine not to look for a GPU for the KV cache. There isn’t one to find, but the flag short-circuits the check.
The KV (Key-Value) cache is the AI’s short-term memory — it stores the context of the current conversation so the model doesn’t have to re-read the entire prompt for every new token.
Because the KV cache is constantly being read from and written to, AI engines usually try to “offload” it to a GPU, which has much faster memory than we do.
Since this specific setup is highly optimized to run purely on a CPU, letting the engine search the hardware buses for a GPU that doesn’t exist is a waste of time and could throw an error. This flag explicitly short-circuits that check, telling the engine to just keep the short-term memory in the system RAM alongside the weights.
Graph layout.
I’ve tried my best to keep this easy to understand, but this part is just plain hard to make explain in a single blog post.
Now onto dark arts. A common frustration in bleeding-edge AI software is that the engine is being developed so fast that the developers don’t have time to write official documentation. If you want to know how to optimize the engine, you have to dig through the raw code or read the Github Pull Request (PR) comments between the developers.
-sm graph -smgs -sas -mea 256 --split-mode-f32
These flags govern how the computational graph is allocated across memory regions. The full documentation ultimatley lives in the code, even if it has some documentation.
The flag -sm graph tells the engine to use Split Mode in the Graph mode (often known in the industry as Tensor Parallelism). This is entirely about how you divide the massive math workload across multiple processors or memory regions (like multiple CPU sockets or GPUs).
Layer Split (The Default/Fallback): The engine slices the model horizontally. Processor A calculates Layers 1–10, then sends the data over the system bus to Processor B, which calculates Layers 11–20. While Processor A is working, Processor B is sitting idle.
Graph Split (The Goal): The engine slices the computational graph vertically. Processor A and Processor B calculate different halves of Layer 1 at the exact same time, combine their answers, and move to Layer 2 together. This keeps all hardware running at 100% simultaneously, drastically improving generation speed.
On this run, the engine declines:
=======================================================
Split mode 'graph' is not supported for Gemma4 external MTP
=> changing split mode to 'layer'
=======================================================
Because MTP creates a much more complicated web of math at the very end of the network, this inference engine simply hasn’t gotten support yet to safely “graph split” (vertically slice) an MTP architecture yet. When the engine boots up, it detects the MTP layers, realizes -sm graph will break the math, and safely downgrades to the slower, sequential layer split so the model can still run.
I’ve included it because it will likely be very helpful in the future, so you should try your luck if you’re working on a newer version.
While -sm graph was disabled, these other flags still apply to how the engine manages memory:
-sas (Split Across Sockets): Explicitly tells the engine how to divide the workload across different physical CPU sockets (NUMA nodes) on a server motherboard. You may note we only have one CPU, but we could get more later, it’s a nice optimization, just bench it to be safe if you do this, since older boards may break current day assumptions.
--split-mode-f32: When data is split across processors, it has to be stitched back together. This flag forces those intermediate connection points to use 32-bit floating-point precision (higher quality math). It prevents the AI from losing intelligence or hallucinating due to rounding errors during the split.
And don’t worry if you see this:
Oops: tensor with strange name rope_freqs.weight
It has a strange name. Strange names will not stop us here. :D
Attention.
Look. ikawrakow, creator of ik_llama.cpp is beyond the word “craked”.
Kawrakow wrote custom CPU kernels to handle Flash Attention, bypassing the need for a GPU during heavy context processing.
This let’s us do something that normally you only do on a GPU.
--flash-attn on --mla-use 3
Flash Attention fuses the attention softmax with its matmuls to avoid materializing the full attention matrix. Duh, anyone knows this, but I’ll try to explain it.
To generate text, an AI has to calculate how every single word in your prompt relates to every other word. Mathematically, this creates a grid of size N×N (where N is the number of tokens).
If you give the AI a short sentence, that grid is small. But if you feed it a 100,000-word document, that matrix explodes into 10 billion cells. Normally, the processor calculates this massive matrix and “materializes” it — meaning it physically writes the entire giant grid out to the main system RAM, only to immediately read it back for the next step.
Flash Attention applies the Kernel Fusion trick, but to the attention mechanism. It calculates the attention scores in small chunks and fuses the math (the softmax) so that the giant N×N matrix is never actually written to RAM. It is calculated and consumed entirely inside the processor’s ultra-fast local cache.
Flash Attention was originally invented strictly for GPUs because it relies on how GPU hardware handles memory blocks. Successfully porting this highly complex, hardware-specific optimization to work on standard CPUs is a massive software engineering achievement. Well done ikawrakow.
--mla-use 3 enables Multi-Head Latent Attention. Earlier, we discussed the KV Cache (the AI’s short-term memory of the conversation that prevents it from having to re-read the whole prompt for every word).
In standard architectures, storing the raw Key and Value data for every single token eats up RAM incredibly fast. Multi-Head Latent Attention (MLA) is a breakthrough architecture that heavily compresses this short-term memory. Instead of saving raw data for every token, it compresses the Keys and Values into a much smaller, dense mathematical representation (a “latent” space).
This drastically reduces the memory footprint of the KV cache, allowing the model to remember massive conversations without running out of system RAM. The flag --mla-use 3 simply tells the engine to activate a specific tier or kernel implementation of this compression.
But all of this is just experimental stuff right, like the split mode graph? Nah. The logs confirm both took:
flash_attn = 1
fused_moe = 1
fused_up_gate = 1
The memory accounting from the logs:
------------------- Layer sizes:
Layer 0: 825.98, 2048.00, 2873.98 77.00 MiB
...
Layer 29: 840.59, 1024.00, 1864.59 77.00 MiB
Layer 30: 748.00, 435.00, 1183.00 MiB (output layer)
--------------------------------------------------------------------------
Total : 24852.46, 56755.00, 81607.46 MiB
Memory required for model tensors + cache: 82355 MiB
An 82 GB footprint in DDR3 on a 2016 Xeon. About 25 GB of weights and 56 GB of KV cache at the full 262K context. The KV cache is larger than the model.
That a working configuration requires 25 flags, half of which are undocumented and a quarter of which fail silently, is a reasonable working definition of the usability moat described in the first post.
The engine loads a 25B-parameter MoE, runs speculative decoding against an MTP drafter, and generates text at reading speed on hardware that was old when the architecture in question hadn’t been invented yet.
When we started this series a week ago, the state of local open-weights AI looked grim. We began by pulling back the curtain on the industry’s favorite marketing spin: the idea that dropping a massive, uncalibrated weights file onto a repository constitutes “open source.” We looked at the massive usability moat built out of missing documentation, silent defaults, and black-box wrappers that hide performance-killing decisions under the guise of user-friendliness.
In the second post, we rolled up our sleeves and waded into the muck. We hunted down obscure, unmerged pull requests, compiled specialized forks (ik_llama.cpp), flipped the standard logic of quantization on its head to build highly precise speculative decoding drafters, and wrote custom scripts to scrub infrastructure data leaks out of our GGUF metadata.
Finally, in this post, we put our money where our mouth is. We dragged a 2016 enterprise relic out of the closet — NAY, out of the grave, a single Intel Xeon running on agonizingly slow DDR3 RAM with absolutely no GPU to speak of — and forced it to run a cutting-edge, 26-billion-parameter Mixture-of-Experts architecture at reading speed. We did without throwing exotic hardware at the problem. Instead we treated the deployment pipeline as a serious thing, and mapped the architecture directly to physical hardware, tuning memory allocation, and unlocking the absolute limits of CPU cache optimization.
The lesson here is simple: The bottleneck to running state-of-the-art AI locally isn’t just in the silicon. It’s the need to understand how the inferrence engine actually works. Deeply.
While a cluster of data-center graphics cards, a corporate API token, or a massive budget are all extremly useful for specific workloads, for the ones that the open models cover, you just need refurbed hardware and to refuse to let black-box tools hold the steering wheel. Armed with the right fork, calibrated quants, and an understanding of the memory architecture under your hood, the usability moat vanishes.
The bleeding edge of Open Weight AI isn’t locked behind a paywall or a model proivider. If you’re already running a homelab, It’s sitting right there on the command line of a ten-year-old server.
Welcome to the other side of the moat. Now go download the quants and get your hands dirty.
Thanks for reading :D
- There is no "moat" (lasting, easy-to-defend technological edge) in AI model businesses. There are just short-term advantages.
- An AI business is a capital-intensive business, just like old factories. Data centers are expensive, models are energy-hungry, and the hardware inside must be replaced every 3–4 years.
- Smaller, specialized models eat margins from below. Transcription, voice, or image detection do not need large models.
There is no reason to expect high margins like you can in traditional software business. Benefits of AI go mostly to consumers.
edit: There is potential for economies of scale. Few megacorps can strive for cost advantage when they achieve scale (Microsoft, Google, Amazon and Meta)
> The E5-2620 v4 is great. Have been using it for 10 years now.
10 years? Damn, that is a long time. I always assumed that heat-induced damage will kill a CPU after a certain amount of time (5-7 years). Am I wrong here? I assume yes. Or are CPUs must stronger/tougher than the bad old days?Noted, and agree (it looks like it has also already been clicked, which I dislike). I honestly I need to redo the themes.
> You say it runs "at reading speed". Have you benchmarked it?
At some point a few weeks ago, yes I think so, but I didn't write it down for some reason... so I'll have to find a time when it's not busy and do it again without a noisy system. Right now the system is noisy, but that said doing it like this:
llama-cli --model gemma-4-26B-A4B-it-Q8_0.gguf --model-draft gemma-4-26B-A4B-t-assistant-GGUF/wikitext-2-raw_ik-llama-mtp_drafter-conservative/gemma-4-26B-A4B-it-assistant-Q8_0.gguf --spec-type mtp --draft-max 3 --draft-p-min 0.0 --color -sm graph -smgs -sas -mea 256 --split-mode-f32 --temp 0.7 --cpu-moe -t 8 --flash-attn on --mla-use 3 --merge-up-gate-experts --special --mlock --run-time-repack --spec-autotune --no-kv-offload --parallel 8 --jinja -p "Why is the sky blue?" -n 128
Gives:
llama_print_timings: load time = 83911.65 ms
llama_print_timings: sample time = 26.99 ms / 128 runs ( 0.21 ms per token, 4742.15 tokens per second)
llama_print_timings: prompt eval time = 343.41 ms / 7 tokens ( 49.06 ms per token, 20.38 tokens per second)
llama_print_timings: eval time = 10639.36 ms / 127 runs ( 83.77 ms per token, 11.94 tokens per second)
llama_print_timings: total time = 11114.98 ms / 134 tokens
So 11.94 tokens per second while it's also playing binary cache and CI builder.When I do it properly, I'll add it to the blog as well!
As long as performance is useable (apply your own metrics!), pulling it from existing hardware is likely the option with the lower eco footprint.
Also: chances are it'll only be used for this purpose occasionally, and/or for a short while. In that scenario [fabricating new hardware] always has the bigger eco footprint.
I don’t know OpenAI’s infra, but to the extent they are buying GPUs and building data centers with their own money, that sounds like a bad move.
Satya has mismanaged the AI transition in many ways, but one thing he got right is that models are commodities, and the value is in applications that apply them to create user benefit. I agree that any company trying to build a moat with a model is not long for this world.
It makes sense to show some ads and get some money at low volume (like a faraway reader wanting to read a story in your local newspaper) but taking money from regular users directly will pay much more.
Newspapers are happy to cannibalize 99% of their ad revenue with a paywall if that 1% subscribes because that’s how much more money you make from someone paying $10-$20/month vs ads.
But yeah, if people use it as a buying recommendation engine, that’s where the money is on ads/referrals but a lot of AI use has little/no connection to buying intent touchpoints.
I just picked up the DDR3 board, an Aliexpress "XD3" so I could reuse some DDR3 ram on a better CPU. Quad channel 1866MT/s is not bad!
I have a dual E5-2667 v2 server with 512GB DDR3 and it's quite nice, the memory bandwidth is higher than of a DDR4 desktop with a way newer CPU, even though it's ECC and registered.
Model name: Intel(R) Xeon(R) CPU E3-1265L V2 @ 2.50GHz
Mainboard Product Name: P8Z77 WS
GPU 05:00.0 VGA compatible controller: NVIDIA Corporation AD106 [GeForce RTX 4060 Ti 16GB] (rev a1)
05:00.1 Audio device: NVIDIA Corporation AD106M High Definition Audio Controller (rev a1)
Memory: 32GBThis works.
Would love to see the benchmarks if someone actually pulls something like that off.
That would be the dream... no fucking Electron! No lockdown modules.
Waiting in terms of latency. When the bus is mostly empty and it takes a while to make a round trip it's great to try to find a few extra passengers to put on it. When the buses are all completely full adding the extra riders just makes the bus stop that much more chaotic.
RAM is really slow at silicon speeds. Very little is reachable in one clock cycle, unless the clock cycle is abysmally slow.
So you'd change the invocation slightly here, but a lot of things you can potentially reuse.
That said, the Gemma 4 E4B models have so far in my experience been... not great when it comes to long context, but they are very passable for basic tasks, and even seem surprisingly okay at tool calls.
Totally just vibes based, I think it goes up to 20+ tps when it's not under load (and that's me trying to be conservative). For context, reading speed at 250 wpm would be around 5 to 6 tokens per second.
I think you've misunderstood what good enough means in the context - which is a model capable of completing the tasks assigned to it without having the breadth of full generalization. Your analogy breaks down because of this - we did get 'good enough' spec profiles for different hardware. That thing you're wearing on your wrist won't have the same specifications as the box you use to play games.
That's correct. The problem is they have smart people, tons of money, and several years to figure that out, and the best thing they can come up is a coding agent.
People -- WANT -- this technology on their home devices and (apparently?) the providers of this tech don't seem to be running a profit so they probably don't want the maintenance tail on their side either.
I think it's a bit different. Inevitable that this becomes a household-run thing? Not likely.
A new game is a totally new world with everything created from scratch. A creation. A model, on the other hand, is a reinterpretation machine for hundreds of years of human creations, but not a creation in itself, more like a discovery.
You would think that by now we would have a much better Bitcoin that's taking over the payment networks of the world but what we actually got is a shitload of shitcoin.
Even then, if a commodity chip isn't pushed full tilt at all times, and assuming that the venting and dissipation are adequate, a commodity chip can last a long time.
Given knowable runtime hardware usage patterns (huge bursts of memory bandwidth saturation) and a single limited core/thread-shared resource (memory bandwidth), one could optimize for the constraint ahead of runtime.
Because most of the performance optimization levers you have available to pull are (a) trade compute for memory bandwidth (e.g. compression), (b) preload when memory bandwidth is available, (c) optimize the choice of what's in cache when, (d) align to cache size / memory boundaries.
Or tl;dr, try to approximate GPU ISAs at the CPU compiler level. (Which why would anyone but hobbyists, because everyone else just buys pallets of Nvidia/AMD or designs their own ML chips?)
e.g. one time I tried making a collaborative drawing application but I messed up the logic, and the brush strokes would just get temporarily mirrored between the client and server, so you'd see it getting drawn over and over again in a loop.
The drawing wasn't stored anywhere, it existed only in the network packets between client and server. Accidental GNU.
http://www.gnuterrypratchett.com/
So I started working on a tool that adds random errors back into my programs. To reintroduce the possibility of such happy little accidents.
Example. If you have a neuron with 16 inputs each 8 bit wide and with a 4 bit weight per input, you will have 16 specialized multipliers each scaling its input by the corresponding weight and then the 16 scaled inputs feed into an adder tree and finally an activation function.
I run my word processing software on my apple 2 (a total joke of a computer) instead of running it on the WANG.
I run my book keeping software on visicalc instead of the IBM.
I run my simulation software on my IBM PC (I even paid for the 8087!) instead of the VAX.
Moore's law has, at least so far, allowed the pioneers with toy computers to grow their toys big enough to solve "big boy" problems after some time has allowed the toy computers to be faster and the pioneers have scaled their crappy home-grown solution to solve their 60% of the problem that was originally solved by some enormous complex system.
Eventually the toy infrastructure gets expensive and solves 90-120% of the "big iron" problem space, but it also grows to cost as much as the big iron solution, but then a new generation of toy software and toy systems emerges to disrupt the "big iron" systems.
See also http://www.catb.org/jargon/html/W/wheel-of-reincarnation.htm...
Make the local AI competent enough to do good image generation and editing, realtime voice and music generation, handle agentic tasks with a framework like Hermes, and you can take your AI places to do tasks in contexts that are inaccessible to or inappropriate for cloud.
Frontier big platform models will be the best, but there's a level of "good enough" for local uses that we're already seeing flourish, and "good enough" for the average joe is almost here.
The primary feature of "AI" is to process information and reason with a natural language interface at speed, the primary feature of AI bigboys is to provide the machinery that runs the "models".
See the difference?
It does seem like the structural characteristics we’ve observed so far suggest there is a kind of flywheel from short-term to long-term advantage due to the capital requirements at various levels.
If you’re Nvidia, making the best GPUs today, the expanding wavefront of demand is consuming them with volume and margins to give you a huge edge in building out the best next generation of GPUs. Similar to how the mobile wave gave TSMC sustained advantage for about a decade now.
I’m guessing this is also what we’re seeing as Anthropic and OpenAI swap spots in the token-vendor market.
[1] https://www.intel.com/content/www/us/en/products/sku/92986/i...
You likely need to replace the flow-through server chassis system with an active "normal" cooler to achieve a bit of silence.
85W might be about right. My old server CPU is in the same ballpark and compiling kernels it reached about 90w in power usage. If you want to keep it running: idle is not very low power unless you have one of the "low power" L versions, keep that in mind.
A GPU typically processes close to 1000 tokens/s during eval.
I have seen the results of some early attempts. It fails in such hilarious ways that all these companies are scared of productizing it. But once someone does it, the taboo is broken and everyone else will follow suit immediately.
If you’ve got something consuming 100 watts average over your 24 hour period, and your electricity costs 20 cents per kWh, you’re already spending almost as much as a Claude subscription.
Just on electricity, this assumes your hardware never fails and you never incur any additional costs.
There’s a big reason why newer more efficient hardware is in demand. Something that’s 10+ years old has drastically worse performance per watt.
Obviously I am not saying to throw away your old hardware as a rule but there is a point where some of this old stuff just isn’t even worth running.
LLMs may or may not be able to cover their costs with it. We'll see - I suspect product placement as recommendations will become a thing as it won't take as much GPU to give a "recommendation" on "the best widget for X". I firmly expect it to become enshittified the same way google and amazon search has.
And that's if LLMs don't become commodified.
~/ik_llama.cpp[main]$ build/bin/llama-cli --model ~/models/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled.IQ4_XS.gguf --spec-type mtp --draft-max 3 --draft-p-min 0.0 --spec-autotune -cnv --color --jinja --special -smgs -sas -mea 256 --temp 0.7 -t 6 --parallel 6 --cpu-moe --merge-up-gate-experts --flash-attn on --mla-use 3 --mlock --run-time-repack --no-kv-offload . works pretty fast, at about 15 t/s:
llama_print_timings: sample time = 45.28 ms / 404 runs ( 0.11 ms per token, 8921.67 tokens per second) llama_print_timings: prompt eval time = 949.42 ms / 51 tokens ( 18.62 ms per token, 53.72 tokens per second) llama_print_timings: eval time = 24067.08 ms / 400 runs ( 60.17 ms per token, 16.62 tokens per second) llama_print_timings: total time = 242192.55 ms / 451 tokens
so i wonder why the params used by the quantified qwen model use way less memory than the ones of gemma.
Are you telling me I should go for it? :)
I do have a dual DGX Spark cluster running MiniMax M2.7 already so I am all for on-prem. But will be interesting how this old machine will perform!
The ‘best’ things are; - fuzzy pattern matching algorithms for traffic analysis, human and other image target recognition.
- targeting algorithms that identify ‘suspicious’ individuals in large volumes of metadata.
- fraud analysis
- antagonistic image and video generation, both for fooling other fraud analysis, but also for propaganda, screwing with other actors, etc.
- directed high speed content generation (text, pictures, video) to spam the ‘algorithm’ and allow near realtime identification of additional buttons to push for given target audiences.
- massive marketing/ad manipulation.
Those budget line items (and the suppliers) really want to stay off the radar however, as it makes their life harder.
> a model capable of completing the tasks assigned to it
The thing is, the "task assigned to it" is changing with improved capabilities. If everyone around you in 2036 is using general AI to do amazing stuff, you will probably have little interest in vibe coding slop like it's 2026.
This is also why the money being poured into datacenters isn't going to result in as much development as you think. It's about leveraging other people's money to lockdown more future hardware. This is going to end exactly like fiber build out in the 2000s. Eventually that fiber got used but the folks who originally paid for it got hosed.
If a vendor can SaaS a solution, then enterprise is generally happy (they don't want to have to hire folks for maintenance), and that completely locks out any ability to run locally.
Between enterprise's ambivalence and the obvious financial incentive to vendors, you get SaaS-only products.
Hosting a blog 24x7 on a laptop is trivial, except for hyperscaling to the front page of HN and Reddit.
But my downtimes are a bit self-inflicted: changing ISPs which I can personally workaround but harder for a blog where one expects uptime.
I don't run it anymore but my old server was a dual xeon (with two of those coolers crammed in) and I rarely heard a peep out of it.
Except you can overclock v3 :)
Well, you can use it for lots of other things as well.
Compared to the cloud you can probably save up to buy a new server every month. And don't underestimate the gains of having something to experiment on and play with.
Bearings in fans, caps etc. are also stuff that you need to keep an eye on.
I just replaced a i5-660 thats been powered on since 2010 24/7, heatpaste was fucked so it crashed during heavy loads :)
2010s Javascript, putting down the controller: Ha, no one will ever surpass my high score for wasting programmer time with dependency churn...
2026 Open Source ML: Hold my beer.
It's probably too small for the timings to be taken seriously.
Claude subscription pricing is a broken way to consider footprint.
Only if you give in to fads and FOMO.
The core tasks people need change at a much smaller pace.
So there is a bigger incentive to run locally something that's gonna get you $20 or $100 worth of bills to OpenAI than to mirror something that is actually free.
Example: In the past there was a whole market for sound cards, if you wanted your computer to have any "multimedia" capabilities you needed to get a sound blaster but now everybody assumes a computer will produce sound, and it's basically for free as all chips have it. Now sound interfaces are still a thing but only for audiophiles who are esoteric enough like me to believe that it's worth to have that extra hi-fi quality.
What I think it could happen, is that eventually AI will be part of all the chips, just like soundcards. And there will be people who will buy specialized AI from companies that perhaps are not OpenAI or Anthropic but second-generation sleepers who watched the carnage in the market and decided to enter when it was reasonable.
This could be Apple, or Nvidia or something new. They're just waiting for the others to do the research and introduce the taste for it to the masses, just like sound blaster made us fall in love with high fidelity sound in our computers.
You can (sometimes) break even if you have a workstation GPU.
Hyperscalers can perform this evaluation very quickly because evaluation can be significantly parallelized. The layer `i` output of token `j` only requires access to the layer `i-1` output of all previous tokens, so a parallel frontier develops. Token (0,0) [(token, layer)] is processed first, then tokens (0,1) and (1,0) can be processed in parallel, then (0,2), (1,1), and (2,0), and so on.
The maximum parallel width becomes equal to the number of layers in the model. Gemma 4 26B-A4B model discussed in this article evidently has 30 layers, giving a 30-fold speedup if the system were otherwise unconstrained (all layers can be run in parallel, and one full set of layer outputs is completed in the KV pass for each pass of the parallel sweep).
In the specific output above, however, the input prompt is only seven tokens long so there are probably considerable non-amortized spinup effects at play.
Now it's compliant with the law.
The test prompt above was "Why is the sky blue?", so there's the seven tokens. I meant to highlight that because I'd expect processing of a thousand-token input to be faster per token than presented.