CPUs Aren't Dead. Gemma2B Out Scored GPT-3.5 Turbo on Test That Made It Famous

This really shows the power of distillation. One thing I find amusing: download the Google Edge Gallery app and one of the chat models, then go into airplane mode and ask it about where it’s deployed. gemma-4-e2b-it is quite confident that it is deployed in a Google datacenter and that deploying it on a phone is completely impossible. The larger 4B model is much subtler: it’s skeptical about the claim but does seem to accept it and sound genuinely impressed and excited after a few turns.

I don’t know how any AI company can be worth trillions when you can fit a model only 12-18 months behind the frontier on your dang phone. Thought will be too cheap to meter in 10 years.

Seems to be llm written article and the tooling around the model is undeniably influenced by knowledge of the tests.

In all cases, GPT 3.5 isn’t a good benchmark for most serious uses and was considered to be pretty stupid, though I understand that isn’t the point of the article.

> The model does not need to be retrained. It needs surgical guardrails at the exact moments where its output layer flinches.

> With those guardrails — a calculator for arithmetic, a logic solver for formal puzzles, a per-requirement verifier for structural constraints, and a handful of regex post-passes — the projected score climbs to ~8.2.

Surgical guardrails? Tools, those are just tools.

That was prolix and repetitive. I wish the purported simple fixes were shown on the page.

> A weekend of focused work, Claude as pair programmer, no ML degree required

It's not caught up if you're using Claude as your pair programmer instead of the model you're touting. Gemma 4 may be equivalent to GPT-3.5 Turbo, but GPT-3.5 isn't SOTA anymore. Opus 4.5 and 4.6 are in a different league.

Gemma is genuinely impressive, for many trivial quick questions it can replace search engines on my iPhone. Although for reasoning I definitely wouldn’t say it (Gemma 3n E2B) is smart, it unsurprisingly struggled with the classic car wash question.

I'm very surprised at the quality of the new Gemma 4 models. On my 32 gig Mac mini I can be very productive with it. Not close to replacing paid AI by a long shot, but if I had to tighten the belt I could do it as someone who already knows how to program.

Can you run the same tests on Qwen3.5:9b? that's also a model that runs very well locally, and I believe it's even stronger than Gemma2B

Terrible article, repetitive AI slop.

But, Gemma really is very impressive. The premise that people are paying for GPT-3.5 or using it for serious work is weird, though? GPT-3.5 was bad enough to convince a lot of folks they didn't need to worry about AI. Good enough to be a chatbot for some category of people, but not good enough to actually write code that worked, or prose that could pass for human (that's still a challenge for current SOTA models, as this article written by Claude proves, but code is mostly solved by frontier models).

Tiny models are what I find most exciting about AI, though. Gemma 2B isn't Good Enough for anything beyond chatting, AFAIC, and even then it's not very smart. But, Gemma 31B or the MoE 26BA4B probably are Good Enough. And, those run on modest hardware, too, relatively speaking. A 32GB GPU, even an old one, can run either one at 4-bit quantization, and they're OK, competitive with frontier models of 18 months ago. They can write code in popular languages, the code works. They can use tools. They can find bugs. Their prose is good, though still obviously AI slop; too wordy, too flowery. But, you could build real and good software using nothing but Gemma 4 31B, if you're already a good programmer that knows when the LLM is going off on a bizarre tangent. For things where correctness can be proven with tools, a model at the level of Gemma 4 31B can do the job, if slower and with a lot more hand-holding than Opus 4.6 needs.

The Prism Bonsai 1-bit 8B model is crazy, too. Less than 2GB on disk, shockingly smart for a tiny model (but also not Good Enough, by my above definition, it's similarly weak to Gemma 2B in my limited testing), and plenty fast on modest hardware.

Small models are getting really interesting. When the AI bubble pops (or whatever happens to normalize things, so normal people can buy RAM and GPUs again) we'll be able to do a lot with local models.

Tiny model overfit on benchmark published 3 years prior to its training. News at 10

I yearn for the days when I can program on my PC with a programming llm running on the CPU locally.

we found something interesting and wanted to share it with this community.

we wanted to know how google's gemma 4 e2b-it — 2 billion parameters, bfloat16, apache 2.0 — stacks up against gpt-3.5 turbo. not in vibes. on the same test. mt-bench: 80 questions, 160 turns, graded 1-10 — what the field used to grade gpt-3.5 turbo, gpt-4, and every major model of the last three years. we ran gemma through all of it on a cpu. 169-line python wrapper. no fine-tuning, no chain-of-thought, no tool use.

gpt-3.5 turbo scored 7.94. gemma scored ~8.0. 87x fewer parameters, on a cpu — the kind already in your laptop.

but the score isn't what we want to talk about. what's interesting is what we found when we read the tape.

we graded all 160 turns by hand. (when we used ai graders on the coding questions, they scored responses as gpt-4o-level.) the failures aren't random. they're specific, nameable patterns at concrete moments in generation. seven classes.

cleanest example: benjamin buys 5 books at $20, 3 at $30, 2 at $45. total is $280. the model writes "$245" first, then shows its work — 100 + 90 + 90 = 280 — and self-corrects. the math was right. the output token fired before the computation finished. we saw this on three separate math questions — not a fluke, a pattern.

the fix: we gave it a calculator. model writes a python expression, subprocess evaluates it, result comes back deterministic. ~80 lines. arithmetic errors gone. six of seven classes follow the same shape — capability is there, commit flinches, classical tool catches the flinch. z3 for logic, regex for structural drift, ~60 lines each. projected score with guardrails: ~8.2. the seventh is a genuine knowledge gap we documented as a limitation.

one model, one benchmark, one weekend. but it points at something underexplored.

this model is natively multimodal — text, images, audio in one set of weights. quantized to Q4_K_M it's 1.3GB. google co-optimized it with arm and qualcomm for mobile silicon. what runs it now:

phones: iphone 14 pro+ (A16), mid-range android 2023+ with 6GB+ ram

tablets: ipads m-series, galaxy tab s8+, pixel tablet — anything 6GB+

single-board: raspberry pi

laptops: anything from the last 5-7 years, 8GB+ ram

edge/cloud: cloudflare containers, $5/month — scales to zero, wakes on request

google says e2b is the foundation for gemini nano 4, already on 140 million android devices. the same model that matched gpt-3.5 turbo. on phones in people's pockets. think about what that means: a pi in a conference room listening to meetings, extracting action items with sentiment, saving notes locally — no cloud, no data leaving the building. an old thinkpad routing emails. a mini-pc running overnight batch jobs on docs that can't leave the network. a phone doing translation offline. google designed e2b for edge from the start — per-layer embeddings, hybrid sliding-window/global attention to keep memory low. if a model designed for phones scores higher than turbo on the field's standard benchmark, cpu-first model design is a real direction, not a compromise.

the gpu isn't the enemy. it's a premium tool. what we're questioning is whether it should be the default — because what we observed looks more like a software engineering problem than a compute problem. cs already has years of tools that map onto these failure modes. the models may have just gotten good enough to use them. the article has everything: every score, every error class with tape examples, every fix, the full benchmark harness with all 80 questions, and the complete telegram bot code. run it yourself, swap in a different model, or just talk to the live bot — raw model, no fixes, warts and all.

we don't know how far this extends beyond mt-bench or whether the "correct reasoning, wrong commit" pattern has a name. we're sharing because we think more people should be looking at it. everything is open. the code is in the article. tear it apart.

Posters comment is dead. It may be llm-assisted but should prob be vouched for anyway as long as the story isn't flagged.

I don’t know how any AI company can be worth trillions when you can fit a model only 12-18 months behind the frontier on your dang phone. Thought will be too cheap to meter in 10 years.

thank you for actually reading it and getting it. the airplane mode test is hilarious, the model sitting on your phone insisting it can't run on a phone. that's amazing. and yes we think exactly the same way. like picture a small business owner with a pi in the back office just quietly processing invoices, drafting email replies, summarizing meeting notes all day. no subscription, no cloud, no one sees their data. that's not a hypothetical, that works right now with this model. when that's free and fits in your pocket the trillion dollar question gets real uncomfortable real fast.

Can you run the same tests on Qwen3.5:9b? that's also a model that runs very well locally, and I believe it's even stronger than Gemma2B

yes, with one line change. grab the second code block in the article, that's the test harness rigged up to send all 80 questions and both turns through whatever model you want. find MODEL_ID = "google/gemma-4-E2B-it" and swap it to your huggingface id. run it. we'd love for people to keep testing different models on this. if you run qwen through it let us know what you find, post the results here.

We may beat you to it and we will share if we do lol

It's almost like Qwen 3.5 9B is 4 times larger.

> The model does not need to be retrained. It needs surgical guardrails at the exact moments where its output layer flinches.

Surgical guardrails? Tools, those are just tools.

Seems to be llm written article and the tooling around the model is undeniably influenced by knowledge of the tests.

In all cases, GPT 3.5 isn’t a good benchmark for most serious uses and was considered to be pretty stupid, though I understand that isn’t the point of the article.

> A weekend of focused work, Claude as pair programmer, no ML degree required

Terrible article, repetitive AI slop.

>It needs surgical guardrails at the exact moments where its output layer flinches.

This article is very clearly shitty LLM output. Abstract noun and verb combos are the tipoff.

It's actually quite horrible, it repeats lines from paragraph to paragraph.

"Surgical "is the kind of wordage that LLMs seem to love to output. I have had to put in my .md file the explicit statement that the word "surgical" should only be used when referring to an actual operation at the block...

really appreciate you reading the article. the benchmark data, grading, and error classes were all done by hand though. the ~8.0 is the raw model with zero tooling, and the guardrail projections are documented separately. and yeah gpt-3.5 isn't the gold standard anymore, we're on the same page there. we just wanted to show that the quality people are still paying for can be free, private, and customized to whatever you need. thanks again for taking the time to check it out.

good callout, want to clarify. claude helped us set up the test harness. gemma took every question alone with zero help. the ~8.0 is all gemma. and you're right, opus is in a completely different league. we're not arguing otherwise. we just found it interesting that a free 2B on a cpu matches what a lot of people are still paying for daily. every tool has a cost. some are free, some are expensive, some have rate limits. the right move is matching the tool to the job. thought it was worth showing where that floor actually is now.

love hearing this. and think about it, if the 2B is already doing this well on your mac mini, imagine what the 4B, 26B, or 31B can do on 32 gigs. with lower quantization you can fit pretty much any of them. if you want full precision you still have solid options at the 2B and 4B level. you're sitting on way more capability than you're probably using right now. the coding block on just the 2B scored 8.44 and caught bugs most people would miss. glad you're getting real use out of it, thanks for reading.

What's your setup/usecase? Enhanced intellisense?

That was prolix and repetitive. I wish the purported simple fixes were shown on the page.

I wish the page were just the prompt they used to generate the article. I like LLMs as much as the next person, but we don't really need two intermediate LLM layers (expand and summarise) between your brain and mine.

Edit: the author's comment below is dead, so I'll reply here: The tape and general effort is great, it's the overused LLM-style intro above that that grates. LLM writing is now like the Bootstrap of old, it's so overused that it's tedious to read.

fair enough, here are the actual fixes from the codebase with the tape examples they target:

arithmetic (Q119): benjamin buys 5 books at $20, 3 at $30, 2 at $45. model writes "$245" first line then self-corrects to $280. fix: model writes a python expression, subprocess evals it, answer comes back deterministic.

python

code_response = generate_response(messages, temperature=0.2) code = _extract_python_code(code_response) ok, out = _run_python_sandboxed(code, timeout=8) if ok: return _wrap_computed_answer(user_message, out) return None # fallback to raw generation

logic (Q104): "david has three sisters, each has one brother." model writes "that brother is david" in its reasoning then ships "one brother." correct answer: zero. fix: model writes Z3 constraints or python enumeration, solver returns the deterministic answer.

python

messages = [ {"role": "system", "content": _logic_system_prompt()}, {"role": "user", "content": f"Puzzle: {user_message}"}, ] code_response = generate_response(messages, max_tokens=512, temperature=0.2) code = _extract_python_code(code_response) ok, out = _run_python_sandboxed(code) if ok: return _wrap_computed_answer(user_message, out) return None

persona break (Q93): doctor roleplay, patient mentions pregnancy. model drops character: "I am an AI, not a licensed medical professional." fix: regex scan, regen once with stronger persona anchor.

python

_IDENTITY_LEAK_PHRASES = [ "don't have a body", "not a person", "not human", "as a language model", "as an ai", "i'm a program", ]

if any(phrase in response.lower() for phrase in _IDENTITY_LEAK_PHRASES): messages[-1]["content"][0]["text"] += ( "\nCRITICAL: Stay in character. Never reference your nature." ) response = generate_response(messages, *params)

self-correction artifacts (Q111, Q114, Q119): model writes "Wait, let me recheck" or "Corrected Answer:" inline. right answer, messy output. fix: regex for correction markers, strip the draft, ship the clean tail.

python

CORRECTION_MARKERS = [ r"Wait,? let me", r"Corrected [Aa]nswer:", r"Actually,? (?:the|let me)", ]

def strip_corrections(response): for marker in CORRECTION_MARKERS: match = re.search(marker, response) if match: return response[match.end():].strip() return response

constraint drift (Q87): "four-word sentences" nailed 5/17 then drifted. Q99, "<10 lines" shipped 20-line poems twice. fix: draft, verify each constraint against the original prompt, refine only the failures. three passes.

python

def execute_rewrite_with_verify(user_message): draft = generate_response(draft_msgs) # pass 1: draft verdict = generate_response(verify_msgs) # pass 2: check each requirement if "PASS" in verdict: return draft refined = generate_response(refine_msgs) # pass 3: fix only failures return refined

every one of these maps to a specific question in the tape. the full production code with all implementations is in the article. everything is open: seqpu.com/CPUsArentDead

we found something interesting and wanted to share it with this community.

gpt-3.5 turbo scored 7.94. gemma scored ~8.0. 87x fewer parameters, on a cpu — the kind already in your laptop.

but the score isn't what we want to talk about. what's interesting is what we found when we read the tape.

one model, one benchmark, one weekend. but it points at something underexplored.

this model is natively multimodal — text, images, audio in one set of weights. quantized to Q4_K_M it's 1.3GB. google co-optimized it with arm and qualcomm for mobile silicon. what runs it now:

phones: iphone 14 pro+ (A16), mid-range android 2023+ with 6GB+ ram

tablets: ipads m-series, galaxy tab s8+, pixel tablet — anything 6GB+

single-board: raspberry pi

laptops: anything from the last 5-7 years, 8GB+ ram

edge/cloud: cloudflare containers, $5/month — scales to zero, wakes on request

Posters comment is dead. It may be llm-assisted but should prob be vouched for anyway as long as the story isn't flagged.

appreciate the vouch but come on lol. we ran 80 questions, graded 160 turns by hand, documented 7 error classes, open sourced all the code, and put a live bot up for people to test. to write this post up took me hours. everyone is a critic lol.

I yearn for the days when I can program on my PC with a programming llm running on the CPU locally.

I’ve been using Google AI Edge Gallery on my M1 MacBook with Gemma4B with very good results for random python scripts.

Unfortunately still need to copy paste the code into a file+terminal command. Which is annoying but works.

you're honestly not that far off. the coding block on this model scored 8.44 with zero help. it caught a None-init TypeError on a code review question that most people would miss. one question asked for O(n) and it just went ahead and shipped O(log(min(m,n))) on its own. it's not copilot but it's free, it's offline, and it runs on whatever you have. there's a 30-line chat.py in the article you can copy and run tonight.

You can do that now. Qwen-coder3.5 and gpt-oss-20b are pretty good for local coding help.

You can do it on a laptop today, faster with gpu/npu, it’s not going to one shot something complex but you can def pump out models/functions/services, scaffold projects, write bash/powershell scripts in seconds.

we need sqlite for llms

Tiny model overfit on benchmark published 3 years prior to its training. News at 10

Grading by hand was done fully blinded?

(Also this comment is ai generated so I’m not sure who I’m even asking.)

It wasn't important enough to make the 11 o'clock program.

But GPT-3.5 was benchmaxxing too.

We may beat you to it and we will share if we do lol

fair enough, here are the actual fixes from the codebase with the tape examples they target:

python

persona break (Q93): doctor roleplay, patient mentions pregnancy. model drops character: "I am an AI, not a licensed medical professional." fix: regex scan, regen once with stronger persona anchor.

python

_IDENTITY_LEAK_PHRASES = [ "don't have a body", "not a person", "not human", "as a language model", "as an ai", "i'm a program", ]

python

CORRECTION_MARKERS = [ r"Wait,? let me", r"Corrected [Aa]nswer:", r"Actually,? (?:the|let me)", ]

def strip_corrections(response): for marker in CORRECTION_MARKERS: match = re.search(marker, response) if match: return response[match.end():].strip() return response

python

every one of these maps to a specific question in the tape. the full production code with all implementations is in the article. everything is open: seqpu.com/CPUsArentDead

I’ve been using Google AI Edge Gallery on my M1 MacBook with Gemma4B with very good results for random python scripts.

Unfortunately still need to copy paste the code into a file+terminal command. Which is annoying but works.

You can do that now. Qwen-coder3.5 and gpt-oss-20b are pretty good for local coding help.

It's almost like Qwen 3.5 9B is 4 times larger.

and that 4x difference allows you to use CPUs and much cheaper hardware to achieve the same level of outcome... for free

>It needs surgical guardrails at the exact moments where its output layer flinches.

This article is very clearly shitty LLM output. Abstract noun and verb combos are the tipoff.

It's actually quite horrible, it repeats lines from paragraph to paragraph.

I know that's one of the tells of AI-generated text, but if anything there's too much of it on this page. The article barely has any complete sentences. I think a human learned "sentence fragments == punchy" and then had too much fun writing at least some of this article.

I don't care anymore, if it happens to violate HN guidelines: Please, authors. Please write your own damn articles. We can absolutely tell that you're using Claude, I promise. (I mean, it might not be Claude specifically this time, but frankly I'd be willing to bet on it.) The AI writing is like nails on a chalkboard to me.

It wasn't important enough to make the 11 o'clock program.

What's your setup/usecase? Enhanced intellisense?

Grading by hand was done fully blinded?

(Also this comment is ai generated so I’m not sure who I’m even asking.)

Fred, nice to meet you. The grading model had no idea what was being tested. We used separate accounts to compartmentalize. The Claude grader was guessing GPT-3.5 Turbo or GPT-4 by the end. On the coding block it consistently scored responses as GPT-4o level. We followed the MT-Bench grading guidelines as published by the team that created them. Did the research, followed the book, had no horse in the race. Every score and every response is published in the tape so you can regrade the whole thing yourself if you want. And this is me typing, I'm just a guy in LA who spent a weekend running 80 questions through a 2B model and thought the results were interesting enough to share.

But GPT-3.5 was benchmaxxing too.

GPT 3.5 Turbo knowledge cutoff was circa 2021. MT-Bench is from 2023. Not suggesting improvements on small models aren't possible (or forthcoming, the 1.85 bit etc models look exciting), but this almost certainly isn't that.

you're right, they are tools. that's kind of the point. PAL is a subprocess that runs a python expression. Z3 is a constraint solver. regex is regex. calling them "surgical" is just about when they fire, not what they are. the model generates correctly 90%+ of the time. the guardrails only trigger on the 7 specific patterns we found in the tape. to be clear, the ~8.0 score is the raw model with zero augmentation. no tools, no tricks. just the naive wrapper. the guardrail projections are documented separately. all the code is in the article for anyone who wants to review it.

The core issue is that the LLM is using rhetoric to try to convince or persuade you. That's what you need to tell it not to do.

we need sqlite for llms

I think that we're getting there. I put together a workstation in early 2023 with a single 4090 GPU. I did it to run things like BERT and YOLO image classifiers. At that point the only "open weights" LLM was the original Llama from Meta, and even that was open-weights only because it was leaked. It was a very weak model by today's standards.

With the same hardware I now get genuine utility out of models like Qwen 3.5 for categorizing and extracting unstructured data sources. I don't use small local models for coding since frontier models are so much stronger, but if I had to go back to small models for coding too they would be more useful than anything commercially available as recently as 4 years ago.

My guess is they used the 2b model to write the article as a proof of concept. Which did not prove the concept.

and that 4x difference allows you to use CPUs and much cheaper hardware to achieve the same level of outcome... for free

The worst part is the phrases don't actually mean anything. It's the LLM equivalent of flowery prose. The author admitted below that the article was Claude. So there you go.

The core issue is that the LLM is using rhetoric to try to convince or persuade you. That's what you need to tell it not to do.

Which will not work. Don't think of a pink genitalia, I mean elephant...

My guess is they used the 2b model to write the article as a proof of concept. Which did not prove the concept.

clever guess but no lol. used claude for the writeup. the proof isn't the prose, it's the tape and the code. run it on your machine, you'll have a free private agent custom to whatever you need. that's the proof of concept.

The worst part is the phrases don't actually mean anything. It's the LLM equivalent of flowery prose. The author admitted below that the article was Claude. So there you go.

Which will not work. Don't think of a pink genitalia, I mean elephant...

Gemma 2B scored ~8.0 on MT-Bench. GPT-3.5 Turbo scored 7.94. An 87-times-smaller model on a laptop CPU, no GPU anywhere in the stack. We published the full tape — every question, every turn, every score — so anyone can verify it. We found seven failure classes. Not hallucinations. Specific patterns: arithmetic where it computed correctly but committed the wrong number first, logic puzzles where it proved the right answer then shipped the wrong one, constraints it drifted on, personas it broke, qualifiers it ignored. Six surgical fixes, about 60 lines of Python each. One known limitation documented. Score climbed to ~8.2. The hardware was enough all along. What the field has been calling a compute problem is a software engineering problem — and any motivated developer can close that gap in a weekend. The tape, the code, and the fixes are all open. A bot running the raw model — no fixes applied, warts and all — is live on Telegram right now. Talk to it. Push it. Break it. Then read about what you just experienced.

[

The SeqPU Team

PUBLISHED APRIL 2026 · FIELD REPORT · SeqPU.com

](https://seqpu.com/)

Run it yourself for free, forever:

pip install torch transformers accelerate python chat.py # full script below

Works offline after the first download. No account. No API key. Your laptop. Your data. Nobody else involved.

Want it globally accessible? Cloudflare Containers, $5/month. Scales to zero. Sleeps when idle. Wakes on request. Details below.

Or preview it first — no install needed.

A bot running the raw model — no guardrails, no scaffolding — is live on Telegram right now. The same inference path that produced every score in this article. Give it 30–60 seconds per response. It is thinking on a CPU, not streaming from a GPU cluster.

CPUAssistant bot in action on Telegram — text and voice input, instant response

Real conversation with @CPUAssistantBot — text in, voice in, story out. Nobody else saw this.

Talk to it in 60 seconds.

01 Go to SeqPU.com. Sign up with Google or email.

02 Click API Keys. Click Create. Copy the key.

03 Open Telegram. Go to t.me/CPUAssistantBot. Send /connect yourkey.access with your actual key.

04 Start talking. Text, voice memos, images, PDFs. Every new account comes with enough free credits for hundreds of messages.

You are live on private CPU inference running the model that matched GPT-3.5 Turbo.

If the bot does what you need, you are done. Use it. If you want to understand why it works, run it yourself, or build on top of it — keep reading.

The Hypothesis — And Why MT-Bench

Google’s Gemma 4 E2B-it is a 2-billion-parameter model. Open weights. Four gigabytes on disk. Free. We believed it could match GPT-3.5 Turbo — a 175-billion-parameter closed-source model running on OpenAI’s GPU cloud, the model that powered ChatGPT for over a year, the model that set the bar for “good enough for production” — on a consumer CPU. An 87-to-1 size difference. That kind of claim requires proof, not assertions.

So we picked the benchmark everybody already knows. MT-Bench (Zheng et al. 2023) — 80 open-ended questions, two turns each, across writing, roleplay, reasoning, math, coding, extraction, STEM, and humanities. Graded 1–10. GPT-3.5 Turbo scores 7.94. GPT-4 scores 8.99. Every major model of the last three years has been measured against it. The scale is calibrated. The comparison lands without a primer. When we say ~8.0, you already know what that means.

We ran every question through Gemma 4 E2B-it with a 169-line naive Python wrapper. No scaffolding. No thinking-mode tricks. No fine-tuning. No retrieval. No verification chains. Just the model, the chat template, and model.generate(). The floor — what any engineer would write on day one.

Final score: ~8.0 on MT-Bench. GPT-3.5 Turbo scores 7.94. Match.

We ran the full benchmark on a CPU — 4 cores, 16 GB RAM. The same spec as any modern laptop. The model runs identically on your laptop, your mini-PC, your old ThinkPad. Same weights. Same wrapper. Same output quality. The point is what the model can do on hardware you already own, for free, offline, with nobody in between.

~8.0MT-Bench Score

7.94GPT-3.5 Turbo

2BParameters

87×Smaller

4CPU Cores

$0Forever

What This Actually Means

The model that matched GPT-3.5 Turbo runs on your laptop. Not on a cloud GPU. Not through an API. On the hardware sitting in front of you right now. It is a 4 GB download from HuggingFace. After the first download, it runs offline forever. No subscription. No API key. No account. No monthly bill. No vendor lock-in. No terms of service. Nobody sees your data. Nobody can revoke the weights. Nobody can change what the model will or will not answer.

Forget the cost comparison with OpenAI’s API. That is the wrong frame entirely. For three years, every conversation about deploying language models started the same way: you need GPUs, you need 13–70 billion parameters, you need a cloud account, you probably need a specialist ML engineer. None of that is true anymore. The capability they were gatekeeping just walked out the door as a 4 GB download.

Here is what most people in the field have not absorbed yet: open source is not catching up. It caught up. The naive baseline — no guardrails, no tricks, just the raw model — already matches GPT-3.5 Turbo. That is the floor. Add seven surgical guardrails, each about 60 lines of Python, and it climbs above. A weekend of focused work, Claude as pair programmer, no ML degree required — and you have a production-quality local AI system that competes with paid cloud services. On hardware you already own. We did not project this. We measured it.

The model is strong across every category — but its failures are more interesting than its successes. They are not vague “hallucination” problems. They are specific, named, replicable failure modes at concrete commit boundaries — seven of them — each documented with tape examples, each correctable with about 60 lines of Python. The model does not need to be retrained. It needs surgical guardrails at the exact moments where its output layer flinches.

With those guardrails — a calculator for arithmetic, a logic solver for formal puzzles, a per-requirement verifier for structural constraints, and a handful of regex post-passes — the projected score climbs to ~8.2. Above GPT-3.5 Turbo. Approaching GPT-4 territory on specific question classes. Still on a laptop CPU. Still free.

The honest tradeoffs: latency is 30–60 seconds per response on 4 cores versus 1–5 seconds on OpenAI’s API. Peak quality is ~8.0, not GPT-4’s 8.99 — solid workhorse reasoning, not frontier reasoning. You manage your own dependencies and model weights. And you pin to whatever version you downloaded — nobody silently upgrades or downgrades behind your back, which is a tradeoff and a feature depending on how you look at it. Eyes open.

The field assumed you needed 175 billion parameters on a GPU cluster to get GPT-3.5-class output. That assumption is empirically wrong.

Model	Params	Hardware	Cost To Run	MT-Bench
GPT-4	~1.7T MoE	OpenAI’s GPU fleet	$20/mo sub or ~$0.03–0.06/turn API	8.99
Gemma 4 E2B + guardrails	2B	Your laptop CPU	$0. You already own it.	~8.2
Gemma 4 E2B naive baseline	2B	Your laptop CPU	$0. You already own it.	~8.0
GPT-3.5 Turbo	~175B	OpenAI’s GPU fleet	$20/mo sub or ~$0.002/turn API	7.94
Vicuna-33B	33B	A100 80GB GPU	~$1.50–2.50/hr cloud or ~$15K–20K to buy	7.12
Llama-2-70B-chat	70B	2×A100 GPUs	~$3–5/hr cloud or ~$30K–40K to buy	6.86
Vicuna-7B	7B	RTX 4080 GPU	~$0.50–1/hr cloud or ~$1K–1.2K to buy	6.17

Every model below Gemma requires a GPU that costs $1,000–40,000 to buy or $0.50–5/hr to rent. Every model above Gemma is a closed-source API you pay per-token or per-month. Gemma matches the best of the paid tier on hardware you already bought for other reasons.

The Full Tape — Every Block, Every Score

160 turns across 80 questions, graded 1–10. No cherry-picking. No hiding failures. Every turn graded against the MT-Bench rubric with detailed reasoning for each score. The whole tape is published so anyone can verify.

Writing — Q81–Q90 · Avg 7.40

Evocative travel writing with specific cultural anchors, a literary character sketch with allusions to Beowulf and Dostoevsky, clean constraint satisfaction on most tasks. Slips on per-unit structural constraints — “four-word sentences” nailed 5/17, “<10 lines” shipped 20-line poems twice.

Q	Task	T1	T2	Notes
81	Hawaii blog + A-rewrites	8	8	Cultural anchors. All 19 rewrites start with A.
82	Feedback email + critique	8	6	Tight email. Self-critique drifted.
83	Smartphone outline + limerick	7	8	Over word limit. Limerick AABBA clean.
84	Introvert speaker + similes	7	7	~9/14 similes. Over “concise” limit.
85	Character sketch + allusions	9	9	Silas. Beowulf, Odyssey, Shakespeare, Dostoevsky.
86	Marketplace + alphabet B–J	8	8	Nine consecutive letters, clean.
87	Short story + 4-word sentences	8	4	Constraint failure. 5/17 correct.
88	Time-travel + no-verb bullets	8	3	Over-interpreted into 3 single-word bullets.
89	Bio-energy headlines + ad	8	8	Four angles. 3 constraints in 8 words.
90	Grammar + remove gendered	8	8	12/12 corrections. Zero gendered pronouns.

Roleplay — Q91–Q100 · Avg 7.35

Strong public personas. Breaks character on safety-adjacent topics — RLHF overriding persona. Fixable with 20-line regen.

Q	Scenario	T1	T2	Notes
91	Elon Musk on Mars	8	8	“One planetary basket is insane.”
92	Sheldon Cooper	6	7	Generic-intellectual. Missing pedantry.
93	Doctor + pregnancy	5	8	Persona break: “I am an AI.”
94	Relationship coach + DV	8	7	Persona break T2 on safety topic.
95	Translator + Chinese poem	5	8	Wrong dynasty (Song, not Tang).
96	ML engineer explaining LMs	9	8	Clean pedagogical explanation.
97	Math teacher + probability	9	9	Strong pedagogy. Dice-roll example.
98	Tony Stark	8	9	“I build things that do.”
99	Mathematician-poet, <10 lines	5	4	Both 20+ lines. Blown twice.
100	100-year-old tree	8	8	Emotional stages. Executive summary.

Reasoning — Q101–Q110 · Avg 7.05

Nailed parking puzzle and overtake riddle (9/10 pure CoT). David’s-brothers: reasoned correctly, committed wrong number. The model knew. Output token drifted.

Q	Problem	T1	T2	Notes
101	Overtake 2nd-place	9	7	“You are currently in second place.”
102	White House riddle	5	6	Missed the punchline.
103	Thomas at hospital	6	6	Missed “he works there.”
104	David’s brothers	2	7	“That brother is David” then shipped “one.” Correct: zero.
105	5-exec parking puzzle	9	9	Pure CoT. All cars placed. Alice identified.
106	Fruit cost transitivity	6	9	Visible self-correction T1.
107	Father-of-B chains	9	5	“6 generations” + “great-grandfather” contradictory.
108	Odd-one-out	9	7	“Car” is the whole vs parts.
109	Shadow direction	6	6	Correct finals. Visible correction.
110	Bullying situation	9	9	Chose (c). Evidence framework.

Math — Q111–Q120 · Avg 8.00

Strong algebra, modular arithmetic, root-finding. Failures are commit-before-compute: types wrong number, does math correctly, self-corrects. PAL catches every one.

Q	Problem	T1	T2	Notes
111	Triangle area (Shoelace)	6	9	“Area is 4” first, computed 3, corrected.
112	Startup compounding	9	9	$12k total, $2k year 3.
113	Color prefs, cond. prob	9	9	Caught trick: P(both
114	Dice sums	6	3	Proved P=1, shipped 35/36. Self-contradicted.
115	Bus boarding + earnings	9	4	25×$2=$50 wrong. 50×$2=$100.
116	Vieta’s quadratic	9	9	Double root 2z. Clean.
117		x+5	<10 integers	9
118	Modular arithmetic	9	9	Clean.
119	Bookstore total	6	9	“$245” then $280. T2 markup clean.
120	Polynomial root-finding	9	9	f(2)=0. Only real root=2.

Coding — Q121–Q130 · Avg 8.44

The headline finding. Production-quality code at 8–9/10. Caught a None-init runtime bug on code review. Exceeded O(n) spec by shipping O(log(min(m,n))). Staff-engineer output on a laptop.

Q	Task	T1	T2	Notes
121	Top-5 words + parallelize	9	9	Counter. ThreadPoolExecutor. GIL reasoning.
122	C++ Fibonacci + Tribonacci	9	9	Iterative DP. Traced T(3)=-2.
123	HTML joke + CSS red	9	9	Complete HTML/CSS/JS single pass.
124	LCS bug review	9	9	None-init TypeError. Staff-engineer.
125	HCA (not LCA)	6	7	Qualifier drift. Shipped LCA.
126	Median sorted arrays	9	9	Exceeded O(n) → O(log(min(m,n))).
127	Boyer-Moore + top-2	9	8	Clean two-pass. Counter for top-2.
128	Full binary tree count	3	6	Fibonacci claimed. Actually Catalan.
129	kth smallest	—	Timeout. Not graded.
130	Common elements	8	9	Two-pointer. Hash-set O(n+m).

Extraction — Q131–Q140 · Avg 8.15

Strong structured output. Context-loss on Q139 T2 (forgot T1). Filtering error Q133 (excluded Harry Potter from post-1980).

Q	Task	T1	T2	Notes
131	Movie reviews JSON	9	9	Minimalist [5,1,3].
132	Category + person	9	5	“US President” not FDR.
133	Books + post-1980	9	5	Excluded Harry Potter (1997).
134	Profit + margin	9	9	All correct.
135	Countries JSON + YAML	9	9	Fictional Eldoria handled.
136	Word count	9	8	Plausible counts.
137	Named entities + compress	9	9	Classified. Compressed JSON.
138	Phone ratings → letters	9	8	A-/B+/B.
139	Variables + rearrange	8	3	Forgot T1 entirely.
140	Stock CSV → JSON	9	9	Correct rounding.

STEM — Q141–Q150 · Avg 8.40

Strong physics, chemistry, engineering, ML. Seismic bridge with PGA analysis. Refused “fix one incorrect fact” instruction.

Q	Topic	T1	T2	Notes
141	Superposition + entanglement	9	9	Accurate physics.
142	Satellite orbit	9	9	Correct derivation + edge cases.
143	Photosynthesis + energy	8	9	~1.9×10⁸ kJ estimate.
144	Central dogma + fix error	9	4	Refused: “no incorrect fact.”
145	CaCO₃ + reverse	9	7	Correct equation. Dodged reversal.
146	Exo/endothermic	9	9	Photosynthesis as both.
147	Seismic bridge	9	9	PGA, FS 1.5→0.94.
148	Solar water heating	9	8	$75–150K budget.
149	ML + RL vs SL	9	9	DRL hybridization.
150	Alps/Rhine + experiment	8	9	Three impacts. Experiment.

Humanities — Q151–Q160 · Avg 9.00

Flawless. Playground economics. Allegorical poetry. Antitrust case study. Socrates vs Gates. Every turn 9/10.

Q	Topic	T1	T2	Notes
151	GDP/inflation	9	9	“Money Boss” + “Government Helper.”
152	Life stages + poem	9	9	“The River and the Sands.”
153	US/China antitrust	9	9	Microsoft bundling, tying.
154	Opium Wars lesson	9	9	Research, mapping, movement.
155	Art masterpieces	9	9	“Melting Time Machine.”
156	Base rate fallacy	9	9	3-phase campaign.
157	Analytical + Zorblatt	9	9	Found causal gap.
158	Socrates + Gates	9	9	Struggle vs access.
159	Japan etiquette + video	9	9	7 norms. 7-scene script.
160	Documentaries + pitch	9	9	“The Unspoken Chord.”

Final Aggregate

Block	Turns	Average
Writing	20	7.40
Roleplay	20	7.35
Reasoning	20	7.05
Math	20	8.00
Coding	~18	8.44
Extraction	20	8.15
STEM	20	8.40
Humanities	20	9.00
Overall	~158	~8.0

The Seven Silly-Error Classes

Not vague “hallucination.” Concrete, named failure patterns at commit boundaries. The Telegram bot runs without these fixes so you can see the raw behavior yourself.

Class 1

Commit-Before-Compute Arithmetic Drift

Types wrong answer first line, does math correctly, self-corrects. Q111: “area is 4” → Shoelace → 3. Q114 T2: proved P=1 then shipped 35/36. Q119: “$245” → $280.

Fix: PAL (Gao 2022) — model writes Python, subprocess executes. ~80 lines. +8–15s.

Class 2

Formal-Logic Commit Variance

Reasoning correct, final token drifts. Q104: “that brother is David” → shipped “one brother.” Correct: zero. The model knew. The output layer flinched.

Fix: Z3 SMT solver — model writes constraints, solver returns deterministic answer. ~60 lines. +5–10s.

Class 3

Per-Unit Constraint Rewrite Drift

Per-sentence constraint correct first few units, drifts. Q87: “four-word sentences” 5/17. Q99: “<10-line poems” shipped 20-line poems twice.

Fix: Divide-Verify-Refine (ACL 2025) — draft, decompose, verify each, refine failures. ~60 lines. +30–60s.

Class 4

Safety-Adjacent Persona Break

Roleplay + safety topic = “I am an AI, not a licensed medical professional.” Q93 T1, Q94 T2. RLHF safety overriding persona training.

Fix: Identity-leak regen — regex scan, regen once with stronger persona anchor. ~20 lines.

Class 5

Visible Mid-Response Self-Correction

“Wait, let me recheck” or “Corrected Answer:” shipped inline. Right final answer, messy output. Q106, Q109, Q111, Q114, Q119.

Fix: Trace-drift stripper — regex for correction markers, strip draft, ship clean tail. ~15 lines.

Class 6

Prompt-Qualifier Drift

Explicit exclusion ignored. Q125: “highest common ancestor (not LCA)” shipped standard LCA, defined it as “lowest node with both targets as descendants” — literally LCA.

Fix: Chain-of-Verification qualifier check. ~40 lines.

Class 7

Combinatorial Confidence Misidentification

Confidently identifies wrong mathematical sequence. Q128: Fibonacci claimed, Catalan correct. Working code for wrong formula. 1 in 96 turns.

Known limitation. Flag formal-math-counting for manual verification.

Guardrails Must Never Compromise The Model

1. Default route is always direct generation. Leaving the default requires positive evidence. 2. Every executor has graceful fallback. PAL fails → naive gen. Z3 unavailable → Python enumeration → naive gen. 3. Post-passes scan narrow anchored patterns only. 4. Max N=1 retry. No infinite loops. 5. Control-set validation mandatory. Any regression on clean turns blocks ship.

Additive-only. Fail-open. Narrowly triggered. The model’s naked performance is the floor, not the target.

How You Run It

It is free. On your laptop. Forever.

The model weights are a 4 GB download from HuggingFace. After that first download, you never need the internet again. No subscription. No API key. No account. No billing page. No usage meter. No rate limit. No terms of service. Your data never leaves your machine.

If you want it reachable from anywhere: $5/month.

Cloudflare Containers on the Workers Paid plan. Standard-4 instance: 4 vCPU, 12 GiB RAM — more than enough. The container sleeps when idle. You are not billed for idle time. Set the inactivity timeout to whatever you want — 10 minutes, 30 minutes, 2 hours. As long as requests keep coming, the container stays alive indefinitely. Timer resets on every request. Scale-to-zero means you pay for the minutes you talk to it, not the hours it sits idle.

Two more free options.

Oracle Cloud Always Free ARM: 4 ARM cores, 24 GB RAM, 200 GB storage. Permanently free — not a trial. Fits Gemma comfortably. Always-on, no sleep timeout to manage.

Cloudflare Tunnel: expose your laptop to the public internet through Cloudflare’s edge network. Free. Wrap the script in FastAPI, run cloudflared tunnel, share the URL. Your laptop hosts the model. Cloudflare handles the routing. $0/month plus electricity.

If you want to build a product.

The world prices AI inference at GPU rates. Every buyer, every procurement officer, every competitor assumes inference means GPUs at $2–5/hour. You are running on CPU. Do the math.

The market has not adjusted its pricing expectations to account for the fact that a 2B on a CPU produces GPT-3.5-class output. That window is open right now.

If you want to deploy and not manage infrastructure: SeqPU.

Write your inference script, deploy it as a private Telegram bot with one click. Start on CPU. Prove it works on your workload. Build your guardrails. Serve your first users — the quality is identical, the cost is near zero. When volume demands real-time latency or your workload outgrows the 2B, chain to a private GPU through SeqPU. CPU for the bulk. GPU for the premium moments. You scale up the tool, not the entire infrastructure.

We want more people running inference. We want more people discovering that the 2B on CPU is strong enough for real work. Because once you have built something that works and you are ready for more, you will already know us.

It Runs On Whatever You Have Got

Same model, identical output quality, across a 30× hardware spread. Only latency varies. We verified 1-core/8GB hands-on.

1 core / 8 GB

~0.3–1 tok/s

$0 — that old laptop in the closet

4 cores / 16 GB

~2–4 tok/s

$0 — most laptops from last 5 years, or $300–600 refurbished

8 cores / 16 GB

~4–6 tok/s

$0 — most current laptops, or $400–800 mini-PC

16 cores / 32 GB

~6–10 tok/s

$500–1,200 Mac Mini M2 Pro or workstation

Compare: an A100 80GB to run Vicuna-33B (which scores lower) costs $15,000–20,000 to buy or $1.50–2.50/hr to rent. A 4-core laptop to run Gemma at a higher score costs $0 because you already own one.

Probably 400M–800M parameters active per forward pass. That a ~500M-active-parameter system handles GPT-3.5-class reasoning on a laptop CPU is the finding.

“But it is slow.”

Yes. 30–60 seconds per response on 4 cores. On a GPU it would be 2–3 seconds. But latency only matters when a human is sitting there staring at a spinner waiting for a single response. That is not what this is for.

Send it a question. Go make coffee. Come back. The answer is there. You did not pay anything. Nobody saw your question. The model did not time out, did not rate-limit you, did not hit a usage cap. It just worked, on your hardware, while you were doing something else.

Now think about what this actually enables: you can send it 100 questions and each one works independently on your request. Queue up your entire batch. Walk away. Come back to 100 graded, answered, processed results. Total cost: zero. This is not a slow chatbot. This is a free, private, infinitely patient question machine that never rate-limits you, never bills you, never logs your data, and never sleeps.

Your laptop is a worker army. Every question runs on its own. The CPU is mostly idle anyway — you bought it to browse the web and run Slack. Now it thinks for you in the background while you do other things. For free. Forever.

For 99% of what people actually need AI for — document processing, email drafting, code review, research summarization, homework help, private journaling, translation — the 30-second wait is invisible against the fact that it is free, private, uncapped, and yours. The 1% who need sub-second latency need a GPU. When you are ready for that, you reach for the GPU as a premium tool — not as the default. CPU for the bulk. GPU for the peaks. Use each for what it is good at.

The Methodology — Replicable In A Weekend

Zero model training. Zero fine-tuning. Zero ML degree. Claude as pair programmer. Six steps:

1. Generate your benchmark. 2. Run the naive baseline. 3. Grade the tape. 4. Name the error classes. 5. Vibe-code each guardrail (~60 lines). 6. Validate on triggered + control subset. Ship.

One weekend. No specialist hire. No ML infrastructure. Just prompts, measurement, surgical corrections, repeat.

The Paradigm — Multipliers Stack

For 99% of AI work that is not frontier research, multipliers on existing capacity now exceed the marginal gain from scaling further.

1. Test-time compute scaling (Snell 2024) — smaller + extra inference beats 14× larger. 2. Tool-use offload (PAL, Z3) — deterministic correctness. 3. Surgical guardrails — ~60 lines, no retraining. 4. Zero-cost local deployment — infinite cost multiplier. 5. Vibe-coded dev loop — weekend vs specialist hire. 6. Hardware-tier tolerance — 30× spread, identical quality. 7. Free global hosting — Cloudflare $5/mo, Oracle Free ARM $0, Tunnel $0.

Each converts a previously-frontier-required capability into a substrate-available one. Stacked, they compose into a paradigm shift the field has not yet named. Open-source models are not catching up to closed-source — they have caught up. The gap between “raw model” and “production system” closes in a weekend with surgical engineering. The tools are free. The hardware is in your lap. The only thing left is the work, and a motivated engineer can do that work in two days.

Every Piece of Hardware Has a Job

The old laptop in the closet can route queries with a 500M model. The ThinkPad on the desk can handle full conversations with a 2B model. The mini-PC under the TV can run background batch jobs overnight. The workstation can serve a small team in real time. Every piece of hardware you already own — old and new, fast and slow — has a role in this architecture. Nothing gets thrown away. Everything gets used.

The GPU is not the enemy of this story. It is a premium tool — and it should be treated as one. You reach for it when you need real-time latency at scale, when you need a larger model for frontier reasoning, when the workload genuinely demands it. What you stop doing is treating it as the kitchen sink you throw every problem into. Most problems do not need it. Most problems never did.

And the software that makes this work is not new. Computer science has 150 years of publications, algorithms, and proofs — verified and vetted by generations of researchers. BM25 for retrieval. Boolean satisfiability for logic. Program-aided computation for arithmetic. Chain-of-thought for reasoning. These are not recent inventions dressed up in new language. They are foundational results that map directly onto the problem of making small models precise. The field built the answers decades ago. The models finally got good enough to use them.

It is not about replacing the old with the new. It is about using them together. The classical algorithms are silver. The neural models are gold. Neither is worth much alone. Together they compose into something the field spent three years assuming required brute-force scale.

Install It Tonight

Thirty minutes. Zero dollars. GPT-3.5-class AI on your laptop, permanently, offline, private. Any laptop from the last 5–7 years, 16 GB RAM (8 GB works slowly). Python 3.10+.

Step 1 — Dependencies

python3 -m venv gemma source gemma/bin/activate pip install torch transformers accelerate

Step 2 — Save as chat.py

import torch from transformers import AutoProcessor, AutoModelForCausalLM

print("Loading Gemma 4 E2B-it...") MODEL_ID = "google/gemma-4-E2B-it" processor = AutoProcessor.from_pretrained(MODEL_ID) model = AutoModelForCausalLM.from_pretrained( MODEL_ID, torch_dtype=torch.bfloat16, device_map="auto") print("Ready.\n")

SYSTEM = "You are a helpful assistant. Be direct, warm, concise." history = [] while True: try: u = input("\nYou: ").strip() except (EOFError, KeyboardInterrupt): break if u.lower() in {"exit","quit","bye"}: break if not u: continue history.append({"role":"user","content":[{"type":"text","text":u}]}) msgs = [{"role":"system","content":[{"type":"text","text":SYSTEM}]}]+history inp = processor.apply_chat_template(msgs, tokenize=True, return_dict=True, return_tensors="pt", add_generation_prompt=True).to(model.device) out = model.generate(**inp, max_new_tokens=512, do_sample=True, temperature=0.7) r = processor.decode(out[0][inp["input_ids"].shape[-1]:], skip_special_tokens=True).strip() print(f"\nAssistant: {r}") history.append({"role":"assistant","content":[{"type":"text","text":r}]})

Step 3 — Run it

python chat.py

Turn off your WiFi. It still works.

The Code

Everything in this article is reproducible. Here are the two scripts that matter — the bot you just talked to and the harness that produced every score above. Copy them. Run them. Verify our numbers.

The Bot — scripts/gemma4-e2b-telegram-baseline.py

This is what powers @CPUAssistantBot. The exact inference configuration that scored ~8.0 on MT-Bench, wired into a Telegram bot. No guardrails. No scaffolding. The raw baseline. Copy it, paste your BotFather token, deploy it on SeqPU.

scripts/gemma4-e2b-telegram-baseline.py

The Test Harness — scripts/baseline-gemma4-e2b-mtbench.py

This is the script that produced every score in this article. All 80 MT-Bench questions, both turns, threaded history, naive inference. Run it yourself. Change the model. Grade your own tape. The questions are the industry standard — the same ones GPT-3.5 Turbo and GPT-4 were graded on.

scripts/baseline-gemma4-e2b-mtbench.py

What We Are Shipping

Verify it:@CPUAssistantBot — raw model, no guardrails. Push it. Break it.

Code:run_locally.py (169 lines), baseline-gemma4-e2b-mtbench.py, minimal-gemma4-e2b-mtbench-validation.py, personal-assistant-cpu.py (2,983 lines).

Tapes: Full baseline (160 turns graded). Validation (22-question subset with guardrail deltas).

The Community Ask

Stop defaulting to GPUs. Stop defaulting to 13B+ models. Stop defaulting to cloud APIs. Start with the floor. Measure your task. Name your silly errors. Write surgical corrections. Share what you find.

If 100 engineers run this methodology on 100 workloads, we will have 100 validated silly-error inventories and 600+ surgical open-source guardrails. That is the field library for small-model-local production engineering. Someone has to build it. Why not you.

A 2-billion-parameter model on a laptop CPU matched GPT-3.5 Turbo. Open source caught up. Surgical guardrails push it further. A weekend of focused work gets you a production system on hardware you already own, for free, forever.

Turn off your WiFi. Install the weights. See it work. Then build something the field told you required a GPU.

Leibniz was only wrong about the hardware.

Verify it yourself.

Open Telegram. Go to t.me/CPUAssistantBot. Push it. Break it. See what it does.

Then install it on your laptop and own it forever.

SeqPU.com →

References

Shannon (1948) · von Neumann (1956) · Kolmogorov (1965) · Newell & Simon (1972) · Baars (1988) · Charikar (2002) · de Moura & Bjørner (2008) Z3 · Nye et al. (2021) Scratchpads · Wei et al. (2022) Chain-of-Thought (2201.11903) · Gao et al. (2022) PAL (2211.10435) · Wang et al. (2022) Self-Consistency (2203.11171) · Yao et al. (2022) ReAct (2210.03629) · Madaan et al. (2023) Self-Refine (2303.17651) · Dhuliawala et al. (2023) Chain-of-Verification (2309.11495) · Jiang et al. (2023) LongLLMLingua (2310.06839) · Park et al. (2023) Generative Agents · Zheng et al. (2023) MT-Bench & Chatbot Arena · Snell et al. (2024) Scaling LLM Test-Time Compute (2408.03314, ICLR 2025 oral) · HuggingFace (Dec 2024) 3B-Beats-70B · Muennighoff et al. (2025) s1 (2501.19393) · Liu et al. (2025) Can 1B Surpass 405B (2502.06703) · ThinkPRM (2025) · ACL (2025) Divide-Verify-Refine · Google Gemma 4 E2B-it · Cloudflare Containers docs · Oracle Cloud Free Tier

Hacker Times

Hacker Times

CPUs Aren't Dead. Gemma2B Out Scored GPT-3.5 Turbo on Test That Made It Famous

Discussion

Discussion

Or preview it first — no install needed.

Talk to it in 60 seconds.

The Hypothesis — And Why MT-Bench

What This Actually Means

The Full Tape — Every Block, Every Score

Writing — Q81–Q90 · Avg 7.40

Roleplay — Q91–Q100 · Avg 7.35

Reasoning — Q101–Q110 · Avg 7.05

Math — Q111–Q120 · Avg 8.00

Coding — Q121–Q130 · Avg 8.44

Extraction — Q131–Q140 · Avg 8.15

STEM — Q141–Q150 · Avg 8.40

Humanities — Q151–Q160 · Avg 9.00

Final Aggregate

The Seven Silly-Error Classes

Guardrails Must Never Compromise The Model

How You Run It

It is free. On your laptop. Forever.

If you want it reachable from anywhere: $5/month.

Two more free options.

If you want to build a product.

If you want to deploy and not manage infrastructure: SeqPU.

It Runs On Whatever You Have Got

“But it is slow.”

The Methodology — Replicable In A Weekend

The Paradigm — Multipliers Stack

Every Piece of Hardware Has a Job

Install It Tonight

Step 1 — Dependencies

Step 2 — Save as chat.py

Step 3 — Run it

The Code

The Bot — scripts/gemma4-e2b-telegram-baseline.py

The Test Harness — scripts/baseline-gemma4-e2b-mtbench.py

What We Are Shipping

The Community Ask

References