A 10 year old Xeon is all you need

Hi HN. I wrote this post after getting frustrated by the lack of ways to run the new Gemma 4 Drafter models, and mainstream tools not prioritizing this, and hiding all the performance levers.

I ended up getting a modern 26B MoE model (Gemma 4) running at reading speed on an old recycled server with a single Xeon E5-2620 v4 and 128GB of DDR3 RAM (and no GPU). It took a lot of work, but it actually worked out somehow.

I've also linked the quants at the end, but they're not gonna run unless you use the ik_llama-cpp fork I mention, see other posts for more details.

I'm not an ML engineer, so I'm by no means an expert, and the server is busy acting as a Nix cache, but if you have any question, I can try to answer, but best effort.

We’re not there yet, but the obvious endgame of the present bubble insanity is open models running on local hardware and devices are “good enough” for most use cases. That will completely implode what’s going on at the moment in tech.

I bought one AMD MI50 32GB back then when they were sold rather cheap (around $150-$170). it can easily generate over 70 tokens per second for gemma 4 26B moe model (q4).

I have no doubt that we will have another wave of cheap retired server gpus just like before. And that is the time when everyone will have their own models at their home.

Or we can just buy the newest medusa halo mini pc. they will be pretty decent, too, albeit pricey.

Nice post and technically impressive work. I agree we need to understand the build pipeline and be able to do things locally. However, depending on your electricity cost, it might not make sense financially. These old servers are not energy efficient at all (I'm guessing that old Xeon server will easily pull 200W on load), and that model is currently at 0.1$/0.3$ per 1M tokens (with 76 tps and 262k context) in Openrouter (also, these servers are LOUD).

EDIT: I stand corrected, 200W is apparently way too high of an estimate. I used to run a bunch of old Xeon servers and they slurped watts like crazy, but I can't remember which ones exactly those were.

Glad to see other people realizing this. I've been running Gemma 26B-A4B Q4 on a 2012 Xeon with 16GB to 24GB of RAM in a container. It's getting around 8 to 12 tokens per second. Obviously it's not comparable to huge contexts and running it on a GPU and the image decoder in llama.cpp is super slow compared to a GPU but for some small automation tasks and general trivia questions it's decent. The speed is just enough to not have to wait for it to finish so you can read along.

Here's my setup. You may want to figure out what the best optimizations are for your specific CPU like AVX2 because mine didn't have most of them. I did try MTP briefly but I wasn't getting performance improvements. You could play around with the batch sizes for cache or context or go even lower for Q2 and don't overcommit on threads either, but I would suggest either defaults or trying out llama-bench. This isn't by any means the best I assume but it worked decently for me and I sometimes swap out Gemma for Qwen. You could also lower q8_0 to q4_0 for more context but it could hurt quality some say, altough I have noticed it too on some models.

# Building

cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=ON -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS -DGGML_OPENMP=ON

# Running

export OPENBLAS_NUM_THREADS=4

export OMP_NUM_THREADS=4

OPENBLAS_NUM_THREADS=4 OMP_NUM_THREADS=4 \

llama.cpp/build/bin/llama-server -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL --temp 1.0 --top-p 0.95 --top-k 64 --min-p 0.00 --jinja --host 0.0.0.0 --port 8080 --cache-type-k q8_0 --cache-type-v q8_0 --threads 4 --threads-batch 4 --ctx-size 8192 -n 8192 --batch-size 2048 --ubatch-size 512 --no-mmap --mlock --chat-template-kwargs '{"enable_thinking":false}' --no-mmproj -np 1 -fa 1

Result is ~12 tokens per second, as reported by OP down in these comments here.

An impressive effort, and better than I would have thought possible on this hardware -- but still pretty far short of what one needs for an satisfactory interactive session.

What intrigues me the most about AI progress, is not AGI or the model du jour by $AI_UNICORN, but rather what can be run locally. I remember having an amusing, but rather useless model in a beefy gaming PC that I had 6 years ago; and now, something that’s a hundred times better on my M5 laptop.

Should the market react to the memory shortage, the progress of the Apple silicon continue at the same pace, and what we’ll be able to run locally in 6 years will be very exciting. or frightening.

Also I don’t know what this means for the valuation of the AI companies. I remember asking about this very idea to one of their employees at an event and instead of answering he bailed out to grab a cocktail.

The E5-2620 v4 is great. Have been using it for 10 years now. Wanted to upgrade until I saw current prices. I have 64 GB ddr4. Paired it with rx 9060 xt 16 GB and games run as fast as ever. Perhaps the cpu is a slight bottleneck in DOOM The Dark Ages, but i'm at 60 fps, so no problem. Light llm on the gpu is a nobrainer, and it's cool to see that things can be tuned to run ok on the cpu. I bought 2667 v4 a month ago for 30$. I'd expect it to give a decent performance boost but I just haven't had the need for it yet, but pushing into llm like in the article I'd probably upgrade because 2667 can handle slightly faster ram.

Apparently Itanium works quite well for LLMs https://medium.com/@tglozar/running-llama-inference-on-intel...

Which makes sense I suppose.

Similar recent posting with optimizations for older Xeon:

High-Performance AI on a Budget: Optimizing llama.cpp for Qwen3.5 Inference on a Dual-GPU HP Z440

https://news.ycombinator.com/item?id=47320244

I may have missed this in the article, but:

What was the net effect of the optimisations? How much faster did it get?

I want to share something strange. I found a typo or two in the post and this absolutely delighted me, because it implies a human wrote the words. (Or was at least heavily involved in the editing.)

Guess I am a species-ist after all ;)

I've got an old HP Z-620 workstation with dual E5-2697 v2 CPUs (24 cores total, 48 threads @ 2.7GHz) and 128GB of DDR3 RAM. The docs say it supports up to 192GB, but I wasn't able to get it to POST with all the RAM slots full.

It's still a "homelab" beast and does great with development and GIS/Mapping applications. I was not able to figure out how to run AI workloads on it with decent performance, however, so I finally broke down and got a dedicated GPU for it. It's pretty great what can still be done with older hardware.

I wish this were somehow tagged with AI, so I would know that it's not about say, general computing or cost-efficiency (e.g. using an old xeon machine from ebay instead of new, in these cost-conscious times.)

As it is, the title is click-bait for me, as 1) it says I need at least a Xeon somehow and 2) as it doesn't say what I actually need it for.

The E5 2620-v4 only supports DDR4.

llama.cpp includes a benchmarking tool called llama-bench https://github.com/ggml-org/llama.cpp/blob/master/tools/llam...

ik_llama includes llama-sweep-bench https://github.com/ikawrakow/ik_llama.cpp/blob/main/examples...

When comparing hardware, the output of these tools is very helpful to let others put it into context. The post says the output is "reading speed" but knowing the prefill and token generation speeds would be a lot more helpful.

How about the iMac Pro? Would that work? I was able to put 128gb in it (not as easy as the regular iMac but possible).

Old hardware is surprisingly effective. I've been considering a side hustle selling offline AI to local businesses who are privacy-sensitive. Medical, legal, places like that.

At the low end, I'd use old Xeons with gobs of DDR3, install some V100s, run a smaller agent for general chat inquiries, and a frontier model for the deeper stuff, with a router that passes between them depending on the complexity.

The frontier model would perform very slowly, but if it's a deep task the user can submit it in a batch in the evening e.g. "Correlate all of these cases and look for patterns" then receive the output with morning coffee.

Of course, AI helped me work out a plan for this. Haha

Doesn't accepting 100% of the MTP draft tokens mean you should just be using the smaller model? Usually the acceptance rate in Qwen36 at least is around 60-70% and the "wrong" tokens are still filled in entirely by the base model, but when you just accept 100% of the draft tokens it seems kind of self defeating unless I'm wrong.

Also I feel like everyone leaves off prompt processing/prefill speeds in these articles. If you are using a very small prompt and asking for mostly generated tokens, sure but I'd love to know the time-to-response of asking for an analysis of an image or a few hundred lines of code.

Did some try to estimates what it would take to bake interference for a capable large language model into silicon so that one can pipeline inputs through it and produce outputs at one token per clock cycle?

Noting for reference that Gemma4 MTP work is in progress[0] on llama.cpp; similar work for Qwen3.6 landed recently and has been great thus far.

[0]: https://github.com/ggml-org/llama.cpp/pull/23398

Does this mean my 15 year old Phenom is too old? But it has 16 gb of DDR3 RAM!

Admittedly web browsers and it don't get along that well. Literally the only thing that drags though on my Slackware 15 system, and even then usually only when it gets to around 15 or so open tabs.

I tried to run gemma 4 on this CPU and it did not go well

https://www.techpowerup.com/cpu-specs/ryzen-7-4800u.c2281

It is way too slow

@cafkafk got a recommendation for a good model that fits into 64GB and leaves a couple GB free for other tasks ?

so how many tokens/s do you get, pp and tg? did I miss it in the article?

Very intriguing. This might be the use for my e5-2430 V2 X2 server that's been lying around. DDR3 is (relatively) cheap now too. Could fit 192GB of RAM in it and play around for much cheaper than a new GPU.

What's the best way to apply this to slightly more modern hardware - i.e. 5800XT 32GB DDR4, 9060XT 16GB?

I have an old 192GB DDR4 Dell Precision with dual Intel Xeon Gold 6130 that I've considered spinning up. What's giving me pause is 250W at idle.

I have run llama.cpp on an i7-2600 with a 1050. It's too slow for everyday usage but it's not too slow to make it obvious AI is going to be everywhere and in everything. It's too easy to run.

Is this John Siracusa? It sounds like it could be something he’d say…

(He has a fully maxed out “last Intel” Mac Pro and laments the lack of replacement).

I have an ancient DDR3 Xeon that doesn't support any AVX (dual x5690 and 96GB 1333 MHz RAM). You reckon it would even build / run at all?

And this is one of those CPUs which had dual slot motherboards so you can have double the fun (and power bill)

https://pcpartpicker.com/products/motherboard/#s=20028,20029...

Granite or sapphire rapids are very under rated for MoE inference loads. But you need a GPU for the KV cache.

Plus many boards also support CXL for RAM expansion over PCI 5!

Source: building a hybrid inference business for regulated industry workloads.

Either they have a E5-2620 V2 from 13 years ago, or they have DDR4, not DDR3. The V3 and V4 only support DDR4.

Would there be any advantage of running this as dual Xeon? The CPUs are $5 and a dual mobo is $50...

I wonder what the tokens per second actually are. Yes, it does say "reading speed" but that varies for everyone, no?

I think one overlooked advantage of older Xeon systems is their availability. Many people can experiment with local AI deployments at a fraction of the cost of building a brand-new setup.

This is great work.

I'd love if anyone knows how I might fare with an old Dell R710 with 2 x Xeon 5600 (12 cores total) and 96Gb of DDR3.

ive been doing the same thing. i refactored a old newtek stream machine . its my new favorite thing to do! adding old PCs to my "starcraft" fleet xD

Successfully ran Gemma4-26B-A4B on my 8yo first-gen Ryzen with a GeForce GTX 1070. It actually ran acceptably well; I was surprised. I even did some coding with it, but the wheels fell abruptly off when it tried several times to use a constant I told it doesn't exist. I only have 32 GiB of RAM in this old bucket, and these results are not worth the RAM consumption, so I put it aside. Maybe if I finish that build with more memory...

Have to point out one boring thing though: this will use a lot more electricity than newer things. So it'll work, but it'll run up your electric bill.

What kind of tokens per second did the op get I saw nothing of this written.

This and the previous one are insanely good articles. Thank you!

I'm now staring at a 10 year old 4U with 256 GB of DDR4 and thinking hmmmmm

Makes you wonder if its possible to squeeze more tps out of a strix halo system using the 16 zen5 cores as well as the gpu.

As someone doing this for fun on a windows 11 machine (96gb ram, 5090 24gb) I wonder if I need any flags to keep the model in memory and avoid swapping to ssd?

I use LM studio and qwen3.5 35B - but never figured out if it is swapping or not.

Om am unrelated note, does anyone know a model that can help with this use case:

https://news.ycombinator.com/item?id=48301635

I also run a Qwen 3.6 moe A4B on old hardware. I set it up with

numactl --membind=1

so it is constrained to one of the memory sticks which speeds up token generation a little.

Well, lets get started. I have 4 of those machines, and they are Two dual processor. They all had 32GB of ram, so now I have two with 64GB, and two with zero. They all hand stock K5000s, now how two have two cards. I stripped the uni processors ram and video cards, and put those into the dual procs. They have 256Gb SSDs, and two 1TB disk drives. One machine has 8Gb of VRam across two cards. Dual processors are 8Cx2 and 32 Threads. They can easily play 16 videos at once. For AI, I have not found a model that I can get above 3 tokens a second. Not a one.

When you use page up and page down key when reading that blog the first line on the screen is obscured by the floating bar or what ever it is. It is not even needed for reading.

I bought one AMD MI50 32GB back then when they were sold rather cheap (around $150-$170). it can easily generate over 70 tokens per second for gemma 4 26B moe model (q4).

I have no doubt that we will have another wave of cheap retired server gpus just like before. And that is the time when everyone will have their own models at their home.

Or we can just buy the newest medusa halo mini pc. they will be pretty decent, too, albeit pricey.

Hi HN. I wrote this post after getting frustrated by the lack of ways to run the new Gemma 4 Drafter models, and mainstream tools not prioritizing this, and hiding all the performance levers.

I've also linked the quants at the end, but they're not gonna run unless you use the ik_llama-cpp fork I mention, see other posts for more details.

I'm not an ML engineer, so I'm by no means an expert, and the server is busy acting as a Nix cache, but if you have any question, I can try to answer, but best effort.

"-t 8 matches physical cores. The machine has 16 SMT threads but only 8 cores. On a memory-bound workload, oversubscribing threads adds scheduling cost without adding throughput: the cores are waiting on DDR3, not on each other."

But ... isnt that a classic use case for SMT? Giving T1 sth. to do while T0 is waiting on DDR(3) and vise-versa?

I also dont understand the explanation of "--cpu-moe". If an expert has ~ 4.0 GiB of Parameters, why does optimizing the sequence of experts minimize cash trashing? With 20 MiB of L3 Cash vs 4.0 GiB of Parameters, it wont cash any noticeable amount of the Parameters, will it?

As mentioned by others, only some Intel Xeon E5-2xxx v4 did support DDR3, and according to Intel, the E5-2620 v4 is not one of them.

Fantastic practical achievement!

I wonder if I could get similar or even better performance from similar Dell T7610 workstation with dual Xeons and also 128GB DDR3?

The CPUs are better core wise, but that probably does not make much difference?

It has CPUs 2 × Xeon E5-2697 v2

Cores / threads 24 cores / 48 threads total

Per-CPU cores 12 cores / 24 threads

Base clock 2.70 GHz

Max turbo 3.50 GHz

It is sitting gather dust but reading spead Gemma sounds promising.

You sure you got DDR3 .. I have 2 e5 v4 rigs at home and both have ddr4 ... Unless I am wrong and 2011-3 supports ddr3 and ddr4

This seems remarkably suited to my situation,

    CPU(s): 32
      On-line CPU(s) list: 0-31
    Vendor ID: GenuineIntel  
    Model name: Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz

Also with 128G. Does 8 dimm sockets imply more actual bandwidth in practice?

This poor thing is currently a YouTube watching box.

(purple on black is really hard to read)

You say it runs "at reading speed". Have you benchmarked it?

Something doesn't add up here. As someone who has only recently built a home-server from an E5-26xx v2 on DDR3 RAM (because I have a sh*tload of 32g DDR3 DIMMs), I can confidently say that the newer cores (E5-26xx v3 and v4) only run on DDR4 memory...

So either you have a v2 instead of a v4 (and run on DDR3 memory), or you have a v4 but with DDR4 memory (not DDR3)

Everything else doesn't work

How many watts is that setup? Cool you got it to work, but maybe only useful for vintage / retro computing rather than practical if the energy consumption makes it economically wasteful.

Would you consider improving the website's layout? Right now I find it below average quality and very distracting. Whether you are an engineer or not is not really important; great engineers can write horrible text or use a layout that is not ideal, for instance.

2620v4 is not a power slurping beast. Depending on the server board, it might not be either. Servers are often loud, but it depends.

There's a lot of budget hosting built around chips like these, and they're suprisingly power efficient.

It should be closer to 85W on load. And it's incredibly silent on even a low end cooler. I rarely get above 50° Celcius.

These servers are loud if you're trying to fit them into a 1U or 2U, which requires high speed fans to generate the necessary static pressure to push air through the case. I run a similar setup in a 4U case with slow 120mm fans and it's fine.

# Building

cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=ON -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS -DGGML_OPENMP=ON

# Running

export OPENBLAS_NUM_THREADS=4

export OMP_NUM_THREADS=4

OPENBLAS_NUM_THREADS=4 OMP_NUM_THREADS=4 \

I'm setting up a Frankenstein system at the moment. It's a Chinese DDR3 X99 motherboard with a 12 core Xeon v3, 32gb 1866MT/s ram, and a 1080 Ti.

I'm shoehorning it back in the Optiplex that donated the ram, so it's not ready to go at the moment, but when I had it running on top of the motherboard box as a test I ran the (9B?) gemma4:e4b-it-q4_K_M since it can fit entirely in the 11gb vram. It flew, more than 50tk/s. A model that small isn't useful for coding, but there could be uses. I'd love to figure out a Wake-on-Use and use it as my personal ChatGPT. I'm not sure how that would work... Maybe proxy the LLM thru a Pi with a script to Wake-on-LAN the PC? It'll be a fun weekend project someday.

My always-on LLM is the dense Gemma4:31b that's not quite half in GPU on a 12gb 2060. It's really slow, but the quality is great and my use case is an automated queue so I'm not sitting there watching the output. I have another 2060 but unfortunately the PC won't POST with both installed for some reason.

Speaking of llama and local compute, there was a tweet from Georgi Gerganov (llama.cpp author) a couple of days ago saying that he is currently using Qwen3.6 27B, running locally on a Mac M2 Ultra or RTX 5090, to assist with llama.cpp development.

Result is ~12 tokens per second, as reported by OP down in these comments here.

An impressive effort, and better than I would have thought possible on this hardware -- but still pretty far short of what one needs for an satisfactory interactive session.

Especially if you consider those smaller models are really cheap and fast on platforms like openrouter. Often by the factor 100-500 cheaper than SOTA models, and 2-5x in TPS.

Yeah took way too long to find that result. Being able to run on slow RAM isn't surprising considering you can run a model off an SSD.

Right. You can also perform RSA encryption on pencil and paper with a scientific calculator. It works, but it's not useful throughput for serious work

I was about to ask that

Happened to me. CoPilot changing prices prompted me to cancel my CoPilot subscription and install a local coding model running entirely in VRAM. Will call Claude APIs when I get really stuck, but I should be able to handle 80% of my needs with a dumber local model.

For a long time, too. Programming languages rarely change much, techniques rarely change, so I should be able to use said model for I hope at least five years; and if at any time they optimize local models to cram even more intelligence into the same amount of VRAM, I can upgrade to that.

I like this path.

This. OpenAI and Anthropic are ultimately compute infrastructure plays and not really AI. Everyone will have models, they'll have the ability to run them. This is why the GPU shortage is in their favor.

You just described the absolute nightmare scenario for the newly minted trillion-dollar companies whose only hope is for enterprises and SMB to move all their business processes to the cloud, with employees competing at token maxxing.

I wouldn't say "completely implode", too much money was poured int it, but it's clear we're heading in that direction. You get a model that is "good enough", plus privacy, plus savings in the long term.

Paradoxically, the better results we get from general harness of coding agents, the less moat Claude and co. get. It's unbelievably how fast some open models outpaced frontier models of just a few months ago.

More likely we will have a compute device like NAS or something which will run one good model locally for all the house members just like we have one wifi router in every house. Nvidia can invest in building such a device as well as the models and make money on the hardware.

If you are willing to spend about 2000 on GPUs, we are almost there.

In my opinion, the bottleneck is the package management layer and not the model capabilities and performance.

I have been an avid Linux user for decades, and if I find it confusing and painful, something is missing.

I disagree. We are currently in a weird period where these frontier AI companies are losing tons of money even on the subscription-based AI models. It's just too compute intensive and there's no way most people are going to be buying the kind of hardware required to run $20 worth of inference every day.

Sadly - it's going to be ads. Advertising is going to get in there and enshittify the whole thing because as always, advertising income is too easy and too plentiful for any company to resist.

Right now the models are fairly agnostic, but we are a hair-breadth away from ChatGPT responding with, "the right tool for this job is a circular saw - something like the Milwaulkee M18, which happens to be on sale at Home Depot this weekend."

this is sorta like saying that being able to run your blog on your laptop will completely implode the cloud business

Curious when NVIDIA monopoly will ends. China will sure release something that can runs on commodity hardware. I wish they will soon.

I find that hard to believe. The AI companies will want to control what's possible and find new things to do that "need" their services. Otherwise it would be like Intel and Microsoft had decided in the year 2000 that computers are "good enough" now and we would have explored what's possible with that hardware ever since.

Fantastic practical achievement!

I wonder if I could get similar or even better performance from similar Dell T7610 workstation with dual Xeons and also 128GB DDR3?

The CPUs are better core wise, but that probably does not make much difference?

It has CPUs 2 × Xeon E5-2697 v2

Cores / threads 24 cores / 48 threads total

Per-CPU cores 12 cores / 24 threads

Base clock 2.70 GHz

Max turbo 3.50 GHz

It is sitting gather dust but reading spead Gemma sounds promising.

Apparently Itanium works quite well for LLMs https://medium.com/@tglozar/running-llama-inference-on-intel...

Which makes sense I suppose.

As it is, the title is click-bait for me, as 1) it says I need at least a Xeon somehow and 2) as it doesn't say what I actually need it for.

Should the market react to the memory shortage, the progress of the Apple silicon continue at the same pace, and what we’ll be able to run locally in 6 years will be very exciting. or frightening.

Things you are not supposed to talk about:

- There is no "moat" (lasting, easy-to-defend technological edge) in AI model businesses. There are just short-term advantages.

- An AI business is a capital-intensive business, just like old factories. Data centers are expensive, models are energy-hungry, and the hardware inside must be replaced every 3–4 years.

- Smaller, specialized models eat margins from below. Transcription, voice, or image detection do not need large models.

There is no reason to expect high margins like you can in traditional software business. Benefits of AI go mostly to consumers.

edit: There is potential for economies of scale. Few megacorps can strive for cost advantage when they achieve scale (Microsoft, Google, Amazon and Meta)

What you can run locally in consumer hardware is progressing pretty well.

If you get a not-quite-the-best gaming GPU like a 5080, you can run local models that are better than the state of the art from early 2025. Depending on what you want to do, you might have to switch models. The one size fits all huge models are still a data center thing.

Its a convenience thing. You can run a whole lot of stuff locally from wikipedia to social media/email/video servers whatever. Most people with a full time job and 2 kids dont do it cause who has time and energy to patch and maintain the ever growing complexity of this stuff. These systems will keep growing complex. That also means more bugs. Age old tradeoff between freedom and convenience.

--what this means for the valuation of the AI companies

Probably nothing. Most users have no idea what an LLM is or how it runs. Anecdotally speaking, I see many LLM users default to whatever their day job provides to them. And even slightly more sophisticated users seem ok with paying for their openai or anthropic subscriptions.

Maybe we will see a small but dedicated group of open weight model users who prefer local llm, but everybody else will just consume from the big providers? The scenario might look something like OS choices today - a small, committed group of Linux users vs the vast majority of other users running Windows, MacOS, or Chrome?

This has always been true of software, particularly games. You can get a 5-6 year old game for a fraction of the price, and run it on modest hardware. But the industry wont sit on its hands for 5 years, there will be newer software that requires better hardware.

Training AI models to drive valuation reminds me of high frequency trading

I'm on a dual-E5 2667-v4 / 256 GB DDR4 Z640 with a 1080ti that I picked up all the various pieces for (aside from SSDs) for less than $500 total in the first half of 2025 (case, PSU, riser board included). I'm still kind of blown away by what you can find aftermarket / secondhand!

I also had no idea RAM and GPU costs would explode they way they did, just happened to do it the right time. I might try to grab a ~$300 3080 on Ebay and sell the 1080ti, but otherwise it's been a great upgrade -- it sucks electricity like Coca Cola, but otherwise performs fantastic as a workstation, and I'm just gonna drive it til the wheels fall off.

    > The E5-2620 v4 is great. Have been using it for 10 years now.

10 years? Damn, that is a long time. I always assumed that heat-induced damage will kill a CPU after a certain amount of time (5-7 years). Am I wrong here? I assume yes. Or are CPUs must stronger/tougher than the bad old days?

But ... isnt that a classic use case for SMT? Giving T1 sth. to do while T0 is waiting on DDR(3) and vise-versa?

As mentioned by others, only some Intel Xeon E5-2xxx v4 did support DDR3, and according to Intel, the E5-2620 v4 is not one of them.

You sure you got DDR3 .. I have 2 e5 v4 rigs at home and both have ddr4 ... Unless I am wrong and 2011-3 supports ddr3 and ddr4

This seems remarkably suited to my situation,

    CPU(s): 32
      On-line CPU(s) list: 0-31
    Vendor ID: GenuineIntel  
    Model name: Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz

Also with 128G. Does 8 dimm sockets imply more actual bandwidth in practice?

This poor thing is currently a YouTube watching box.

I want to share something strange. I found a typo or two in the post and this absolutely delighted me, because it implies a human wrote the words. (Or was at least heavily involved in the editing.)

Guess I am a species-ist after all ;)

> But ... isnt that a classic use case for SMT? Giving T1 sth. to do while T0 is waiting on DDR(3) and vise-versa?

Waiting in terms of latency. When the bus is mostly empty and it takes a while to make a round trip it's great to try to find a few extra passengers to put on it. When the buses are all completely full adding the extra riders just makes the bus stop that much more chaotic.

I won't speak for cafkafk, but I have two E5 (v3/v4) systems one on DDR4 and one on DDR3. This generation of CPU all support DDR4, but a few skus do support DDR3 also. ChatGPT told me they were niche products to meet specific customer needs.

I just picked up the DDR3 board, an Aliexpress "XD3" so I could reuse some DDR3 ram on a better CPU. Quad channel 1866MT/s is not bad!

The first two generations supported DDR3 only. Haswell and Broadwell (v4) brought DDR4 support.

One thing to note: These Xeons have quad memory channels, that usually means double the bandwidth of an equivalent desktop CPU, if you populate all the slots.

I have a dual E5-2667 v2 server with 512GB DDR3 and it's quite nice, the memory bandwidth is higher than of a DDR4 desktop with a way newer CPU, even though it's ECC and registered.

I hope LLMs don’t get trained with this reply and start adding typos for making it look like it came from a human :)

I self host on old HP Z-840 with 2x3.6 GHz Xeons 24 total cores, and 512 GB RAM. Cost me peanuts used and works like a charm for many years already

I may have missed this in the article, but:

What was the net effect of the optimisations? How much faster did it get?

Similar recent posting with optimizations for older Xeon:

High-Performance AI on a Budget: Optimizing llama.cpp for Qwen3.5 Inference on a Dual-GPU HP Z440

https://news.ycombinator.com/item?id=47320244

If you are willing to spend about 2000 on GPUs, we are almost there.

In my opinion, the bottleneck is the package management layer and not the model capabilities and performance.

I have been an avid Linux user for decades, and if I find it confusing and painful, something is missing.

(purple on black is really hard to read)

You say it runs "at reading speed". Have you benchmarked it?

> (purple on black is really hard to read)

Noted, and agree (it looks like it has also already been clicked, which I dislike). I honestly I need to redo the themes.

> You say it runs "at reading speed". Have you benchmarked it?

At some point a few weeks ago, yes I think so, but I didn't write it down for some reason... so I'll have to find a time when it's not busy and do it again without a noisy system. Right now the system is noisy, but that said doing it like this:

llama-cli --model gemma-4-26B-A4B-it-Q8_0.gguf --model-draft gemma-4-26B-A4B-t-assistant-GGUF/wikitext-2-raw_ik-llama-mtp_drafter-conservative/gemma-4-26B-A4B-it-assistant-Q8_0.gguf --spec-type mtp --draft-max 3 --draft-p-min 0.0 --color -sm graph -smgs -sas -mea 256 --split-mode-f32 --temp 0.7 --cpu-moe -t 8 --flash-attn on --mla-use 3 --merge-up-gate-experts --special --mlock --run-time-repack --spec-autotune --no-kv-offload --parallel 8 --jinja -p "Why is the sky blue?" -n 128

Gives:

  llama_print_timings:        load time =   83911.65 ms
  llama_print_timings:      sample time =      26.99 ms /   128 runs   (    0.21 ms per token,  4742.15 tokens per second)
  llama_print_timings: prompt eval time =     343.41 ms /     7 tokens (   49.06 ms per token,    20.38 tokens per second)
  llama_print_timings:        eval time =   10639.36 ms /   127 runs   (   83.77 ms per token,    11.94 tokens per second)
  llama_print_timings:       total time =   11114.98 ms /   134 tokens

So 11.94 tokens per second while it's also playing binary cache and CI builder.

When I do it properly, I'll add it to the blog as well!

How many watts is that setup? Cool you got it to work, but maybe only useful for vintage / retro computing rather than practical if the energy consumption makes it economically wasteful.

IDK about OPs setup, but I run a pile of E5-2683v4 Xeon recycled servers for Ceph and self hosted business SaaS usage.

One node's ipmitool sensor report (and self-monitoring PSU, so grain of salt, but my UPS side monitoring tracks closely), reports 250-300w average power use. This though, mind you is for running 22 spinning disks, 2 SAS/SATA SSDs, and 4 NVME ssds, and 768GB of DDR4.

Mid-gen 2015ish Xeons were not great at power reduction, but if you are pegging the cores, they were never particularly slow, and they did have lots of PCIe lanes. This boils down to the CPU/mobo itself not being that big a cost floor, especially if you have high utilization rates.

As a comparison, my main desktop development machine, running a Threadripper 9970X, 128GB of DDR5, a RDNA4 GPU, and a small pile of NVME drives has a power floor of roughly 250W. Some CPU centric workloads you'll definitely lose out on on the older gens of machines, but they are by no means impractical.

Maybe for a desktop usecase they are absolutely suboptimal nowadays, but for a lot of realworld usecases I would say they're still relevant.

---

Like the author posts for the LLM usecase, I think optimizing the hardware choice to the application and not leaving levers unpulled is a big key, especially considering how wide a variety of bandwidth/power draw/peak frequency/corecount SKUs exist in the Xeon lines. Without knowing what you intend to run and fitting the correct processor to it, you will end up with a disappointingly poor environment fit.

How many kWh to fabricate a brand new machine better suited to the task?

As long as performance is useable (apply your own metrics!), pulling it from existing hardware is likely the option with the lower eco footprint.

Also: chances are it'll only be used for this purpose occasionally, and/or for a short while. In that scenario [fabricating new hardware] always has the bigger eco footprint.

So either you have a v2 instead of a v4 (and run on DDR3 memory), or you have a v4 but with DDR4 memory (not DDR3)

Everything else doesn't work

There are some OEM-only v3/v4 parts with dual memory controllers (because of a RAM supply crunch at the time, funnily enough), but the E5-2620 v4 is not one of them. The classic example is the very popular 12-core E5-2678 v3.

This is not true. A few well known brands made both DDR3 and DDR4 servers that support v3 & v4 chips. Ask me how I know :-)

It looks like Supermicro had some DDR3 Xeon v3/v4 boards, and the first thing that came to mind was a Shenzen workstation/gaming board using recycled parts... haven't searched on that but it's bound to exist.

Yeah, the Intel reference page only lists DDR4, not DDR3:

https://www.intel.com/content/www/us/en/products/sku/92986/i...

> So either you have a v2 instead of a v4 (and run on DDR3 memory), or you have a v4 but with DDR4 memory (not DDR3)

Yup that's odd... I've got a Xeon 2680 v4 (14 cores) (amazing bargain of a little beast btw) and it's indeed on DDR4 and I saw all Xeons v4 as supporting DDR4 only.

Full spec (brand/model/mobo type) would have been nice: mine's an HP Z440 workstation repurposed as a server (which I only turn on when I'm working and which I religiously turn off before going to bed).

I like this path.

> Will call Claude APIs when I get really stuck, but I should be able to handle 80% of my needs with a dumber local model.

I experiment with all of the local models I can fit into 32GB of VRAM and I have subscriptions to multiple SOTA providers.

The difference between them is very large, unfortunately. The local models can handle small tasks and refactoring mostly okay, but doing anything challenging with them becomes a waste of time. Unfortunately the waste isn’t immediately obvious because they will come back with something that looks like it works, but then on closer examination I need to throw it out and reset them in a usable direction.

And like Google and Meta, these companies are going to morph into advertising giants. Advertising is an economic black hole and it eats everything that comes close.

How does that view align with Anthropic leasing data centers from others?

I don’t know OpenAI’s infra, but to the extent they are buying GPUs and building data centers with their own money, that sounds like a bad move.

Satya has mismanaged the AI transition in many ways, but one thing he got right is that models are commodities, and the value is in applications that apply them to create user benefit. I agree that any company trying to build a moat with a model is not long for this world.

Do you think there will still be an incentive to release weights in that scenario? Everyone will have models only if there continue to be companies releasing weights.

Maybe. But if we can all run our own model locally in 2 years on commodity hardware OpenAI and Anthropic will start to look like WeWork during the pandemic

I keep intending to find time to try them. What are you seeing the best results with?