Even though the project was meant to be educational, it gave me an idea I can't get out of my head: what if we started building ultra-optimized inference engines tailored to an exact GPU+model combination? GPUs are expensive and harder to get with each day. If you remove enough abstractions and code directly to the exact hardware/model, you can probably optimize things quite a lot (I hope). Maybe run an agent which tries to optimize inference in a loop (like autoresearch), empirically testing speed/quality.
The only problem with this is that once a model becomes outdated, you have to do it all again from scratch.
I have an older W7900 (RDNA3) which, besides 48GB of VRAM, has some pretty decent roofline specs - 123 FP16 TFLOPS/INT8 TOPS, 864 GB/s MBW, but has had notoriously bad support both from AMD (ROCm) as well as llama.cpp.
Recently I decided I'd like to turn the card into a dedicated agentic/coder endpoint and I started tuning a W8A8-INT8 model. Over the course of a few days of autolooping (about 800 iterations using a variety of frontier/SOTA models, Kimi K2.6 did surprisingly well), and I ended up with prefill +20% and decode +50% faster than the best llama.cpp numbers for Qwen3.6 MoE.
I'm currently grinding MTP and DFlash optimization on it, but I've been pretty pleased with the results, and will probably try Gemma 4 next.
The inference engines in use already include different backend building blocks optimized for different hardware.
While there are places where you can pick up some low hanging fruit for less popular platforms, there isn't a lot of room to squeeze in super optimized model-runners for specific GPU families and get much better performance. The core computations are already done by highly optimized kernels for each GPU.
There are forks of llama.cpp that have better optimizations for running on CPU architectures, but (barring maintainer disagreements) a better use of time is to target merging these improvements upstream instead of trying to make super specific model+GPU runners.
ds4.c is a small native inference engine for DeepSeek V4 Flash. It is
intentionally narrow: not a generic GGUF runner, not a wrapper around another
runtime, and not a framework. The main path is a DeepSeek V4 Flash-specific
Metal graph executor with DS4-specific loading, prompt rendering, KV state, and
server API glue.
This project would not exist without llama.cpp and GGML, make sure to read the acknowledgements section, a big thank you to Georgi Gerganov and all the other contributors.
Now, back at this project. Why we believe DeepSeek v4 Flash to be a pretty special model deserving a stand alone engine? Because after comparing it with powerful smaller dense models, we can report that:
That said, a few important things about this project:
llama.cpp and GGML, largely written by hand.ds4.c does not link against GGML, but it exists thanks to the path opened by the
llama.cpp project and the kernels, quantization formats, GGUF ecosystem, and hard-won
engineering knowledge developed there.
We are thankful and indebted to llama.cpp
and its contributors. Their implementation, kernels, tests, and design choices were
an essential reference while building this DeepSeek V4 Flash-specific inference path.
Some source-level pieces are retained or adapted here under the MIT license: GGUF
quant layouts and tables, CPU quant/dot logic, and certain Metal kernels. For this
reason, and because we are genuinely grateful, we keep the GGML authors copyright
notice in our LICENSE file.
This implementation only works with the DeepSeek V4 Flash GGUFs published for
this project. It is not a general GGUF loader, and arbitrary DeepSeek/GGUF files
will not have the tensor layout, quantization mix, metadata, or optional MTP
state expected by the engine. The 2 bit quantizations provided here are not
a joke: they behave well, work under coding agents, call tools in a reliable way.
The 2 bit quants use a very asymmetrical quantization: only the routed MoE
experts are quantized, up/gate at IQ2_XXS, down at Q2_K. They are the
majority of all the model space: the other components (shared experts,
projections, routing) are left untouched to guarantee quality.
Download one main model:
./download_model.sh q2 # 128 GB RAM machines
./download_model.sh q4 # >= 256 GB RAM machines
The script downloads from https://huggingface.co/antirez/deepseek-v4-gguf,
stores files under ./gguf/, resumes partial downloads with curl -C -, and
updates ./ds4flash.gguf to point at the selected q2/q4 model. Authentication
is optional for public downloads, but --token TOKEN, HF_TOKEN, or the local
Hugging Face token cache are used when present.
./download_model.sh mtp fetches the optional speculative decoding support
GGUF. It can be used with both q2 and q4, but must be enabled explicitly with
--mtp. The current MTP/speculative decoding path is still experimental: it is
correctness-gated and currently provides at most a slight speedup, not a
meaningful generation-speed win.
Then build:
make
./ds4flash.gguf is the default model path used by both binaries. Pass -m to
select another supported GGUF from ./gguf/. Run ./ds4 --help and
./ds4-server --help for the full flag list.
These are single-run Metal CLI numbers with --ctx 32768, --nothink, greedy
decoding, and -n 256. The short prompt is a normal small Italian story
prompt. The long prompts exercise chunked prefill plus long-context decode.
Q4 requires the larger-memory machine class, so M3 Max Q4 numbers are N/A.
| Machine | Quant | Prompt | Prefill | Generation |
|---|---|---|---|---|
| MacBook Pro M3 Max, 128 GB | q2 | short | 58.52 t/s | 26.68 t/s |
| MacBook Pro M3 Max, 128 GB | q2 | 11709 tokens | 250.11 t/s | 21.47 t/s |
| MacBook Pro M3 Max, 128 GB | q4 | short | N/A | N/A |
| MacBook Pro M3 Max, 128 GB | q4 | long | N/A | N/A |
| Mac Studio M3 Ultra, 512 GB | q2 | short | 84.43 t/s | 36.86 t/s |
| Mac Studio M3 Ultra, 512 GB | q2 | 11709 tokens | 468.03 t/s | 27.39 t/s |
| Mac Studio M3 Ultra, 512 GB | q4 | short | 78.95 t/s | 35.50 t/s |
| Mac Studio M3 Ultra, 512 GB | q4 | 12018 tokens | 448.82 t/s | 26.62 t/s |
One-shot prompt:
./ds4 -p "Explain Redis streams in one paragraph."
No -p starts the interactive prompt:
./ds4
ds4>
The interactive CLI is a real multi-turn DS4 chat. It keeps the rendered chat
transcript and the live Metal KV checkpoint, so each turn extends the previous
conversation. Useful commands are /help, /think, /think-max, /nothink,
/ctx N, /read FILE, and /quit. Ctrl+C interrupts the current generation
and returns to ds4>.
The CLI defaults to thinking mode. Use /nothink or --nothink for direct
answers. --mtp MTP.gguf --mtp-draft 2 enables the optional MTP speculative
path; it is useful only for greedy decoding, currently uses a confidence gate
(--mtp-margin) to avoid slow partial accepts, and should be treated as an
experimental slight-speedup path.
Start a local OpenAI/Anthropic-compatible server:
./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192
The server is Metal-only. It keeps one mutable graph/KV checkpoint in memory, so stateless clients that resend a longer version of the same prompt can reuse the shared prefix instead of pre-filling from token zero.
Request parsing and sockets run in client threads, but inference itself is serialized through one Metal worker. The current server does not batch multiple independent requests together; concurrent requests wait their turn on the single live graph/session.
Supported endpoints:
GET /v1/modelsGET /v1/models/deepseek-v4-flashPOST /v1/chat/completionsPOST /v1/completionsPOST /v1/messages/v1/chat/completions accepts the usual OpenAI-style messages,
max_tokens/max_completion_tokens, temperature, top_p, top_k, min_p,
seed, stream, stream_options.include_usage, tools, and tool_choice.
Tool schemas are rendered into DeepSeek's DSML tool format, and generated DSML
tool calls are mapped back to OpenAI tool calls.
/v1/messages is the Anthropic-compatible endpoint used by Claude Code style
clients. It accepts system, messages, tools, tool_choice, max_tokens,
temperature, top_p, top_k, stream, stop_sequences, and thinking
controls. Tool uses are returned as Anthropic tool_use blocks.
Both APIs support SSE streaming. In thinking mode, reasoning is streamed in the native API shape instead of being mixed into final text.
Minimal OpenAI example:
curl http://127.0.0.1:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model":"deepseek-v4-flash",
"messages":[{"role":"user","content":"List three Redis design principles."}],
"stream":true
}'
ds4-server can be used by local coding agents that speak OpenAI-compatible
chat completions. Start the server first, and set the client context limit no
higher than the --ctx value you started the server with:
./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192
You can use larger context and larger cache if you wish. Full context of 1M tokens is going to use more or less 26GB of memory (compressed indexer alone will be like 22GB), so configure a context which makes sense in your system. With 128GB of RAM you would run the 2-bit quants, which are already 81GB, 26GB are going to be likely too much, so a context window of 100~300k tokens is wiser.
The 384000 output limit below avoids token caps since the model is able
to generate very long replies otherwise (up to 384k tokens). The server
still stops when the configured context window is full.
For opencode, add a provider and agent entry to
~/.config/opencode/opencode.json:
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"ds4": {
"name": "ds4.c (local)",
"npm": "@ai-sdk/openai-compatible",
"options": {
"baseURL": "http://127.0.0.1:8000/v1",
"apiKey": "dsv4-local"
},
"models": {
"deepseek-v4-flash": {
"name": "DeepSeek V4 Flash (ds4.c local)",
"limit": {
"context": 100000,
"output": 384000
}
}
}
}
},
"agent": {
"ds4": {
"description": "DeepSeek V4 Flash served by local ds4-server",
"model": "ds4/deepseek-v4-flash",
"temperature": 0
}
}
}
For Pi, add a provider to ~/.pi/agent/models.json:
{
"providers": {
"ds4": {
"name": "ds4.c local",
"baseUrl": "http://127.0.0.1:8000/v1",
"api": "openai-completions",
"apiKey": "dsv4-local",
"compat": {
"supportsStore": false,
"supportsDeveloperRole": false,
"supportsReasoningEffort": true,
"supportsUsageInStreaming": true,
"maxTokensField": "max_tokens",
"supportsStrictMode": false,
"thinkingFormat": "deepseek",
"requiresReasoningContentOnAssistantMessages": true
},
"models": [
{
"id": "deepseek-v4-flash",
"name": "DeepSeek V4 Flash (ds4.c local)",
"reasoning": true,
"thinkingLevelMap": {
"off": null,
"minimal": "low",
"low": "low",
"medium": "medium",
"high": "high",
"xhigh": "xhigh"
},
"input": ["text"],
"contextWindow": 100000,
"maxTokens": 384000,
"cost": {
"input": 0,
"output": 0,
"cacheRead": 0,
"cacheWrite": 0
}
}
]
}
}
}
Optionally make it the default Pi model in ~/.pi/agent/settings.json:
{
"defaultProvider": "ds4",
"defaultModel": "deepseek-v4-flash"
}
For Claude Code, use the Anthropic-compatible endpoint. A wrapper like this
matches the local ~/bin/claude-ds4 setup:
#!/bin/sh
unset ANTHROPIC_API_KEY
export ANTHROPIC_BASE_URL="${DS4_ANTHROPIC_BASE_URL:-http://127.0.0.1:8000}"
export ANTHROPIC_AUTH_TOKEN="${DS4_API_KEY:-dsv4-local}"
export ANTHROPIC_MODEL="deepseek-v4-flash"
export ANTHROPIC_CUSTOM_MODEL_OPTION="deepseek-v4-flash"
export ANTHROPIC_CUSTOM_MODEL_OPTION_NAME="DeepSeek V4 Flash local ds4"
export ANTHROPIC_CUSTOM_MODEL_OPTION_DESCRIPTION="ds4.c local GGUF"
export ANTHROPIC_DEFAULT_SONNET_MODEL="deepseek-v4-flash"
export ANTHROPIC_DEFAULT_HAIKU_MODEL="deepseek-v4-flash"
export ANTHROPIC_DEFAULT_OPUS_MODEL="deepseek-v4-flash"
export CLAUDE_CODE_SUBAGENT_MODEL="deepseek-v4-flash"
export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1
export CLAUDE_CODE_DISABLE_NONSTREAMING_FALLBACK=1
export CLAUDE_STREAM_IDLE_TIMEOUT_MS=600000
exec "$HOME/.local/bin/claude" "$@"
Claude Code may send a large initial prompt, often around 25k tokens, before it
starts doing useful work. Keep --kv-disk-dir enabled: after the first expensive
prefill, the disk KV cache lets later continuations or restarted sessions reuse
the saved prefix instead of processing the whole prompt again.
DeepSeek V4 Flash has distinct non-thinking, thinking, and Think Max modes.
The server defaults to thinking mode. reasoning_effort=max requests Think
Max, but it is only applied when the context size is large enough for the model
card recommendation; smaller contexts fall back to normal thinking. OpenAI
reasoning_effort=xhigh still maps to normal thinking, not Think Max.
For direct replies, use thinking: {"type":"disabled"}, think:false, or a
non-thinking model alias such as deepseek-chat.
Chat/completion APIs are stateless: agent clients usually resend the whole
conversation every request. ds4-server handles this by comparing the rendered
token stream with cached token prefixes. The live in-memory checkpoint covers
the current session; the disk KV cache makes useful prefixes survive session
switches and server restarts.
For RAM reasons there is currently only one live KV cache in memory. When a new unrelated session replaces it, the old checkpoint can only be resumed without re-processing if it was written to the disk KV cache. In other words, memory cache handles the active session; disk cache is the resume mechanism for different sessions.
Enable it with:
./ds4-server --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192
The cache key is the SHA1 of exact token IDs, not raw text. Each token ID is
hashed as a little-endian 32-bit integer, and files are named <sha1>.kv.
The file is intentionally written with ordinary read/write I/O, not
mmap, so restoring cache entries does not add more VM mappings to a process
that already maps the model.
On disk, a cache file is:
KVC fixed header, 48 bytes
u32 rendered_text_bytes
rendered_text_bytes of UTF-8-ish token text
DS4 session payload, payload_bytes from the KVC header
The fixed header is little-endian:
0 u8[3] magic = "KVC"
3 u8 version = 1
4 u8 routed expert quant bits, currently 2 or 4
5 u8 save reason: 0 unknown, 1 cold, 2 continued, 3 evict, 4 shutdown
6 u8[2] reserved
8 u32 cached token count
12 u32 hit count
16 u32 context size the snapshot was written for
20 u8[4] reserved
24 u64 creation Unix time
32 u64 last-used Unix time
40 u64 DS4 session payload byte count
The rendered text is the tokenizer-decoded text for the cached token prefix. It is stored only for observability, so humans can inspect a cache directory without decoding token IDs. It is not used as the key and it is not trusted when loading; after load, the stored checkpoint tokens must still match the incoming request prefix.
The DS4 session payload starts with thirteen little-endian u32 fields:
0 magic = "DSV4"
1 payload version = 1
2 saved context size
3 prefill chunk size
4 raw KV ring capacity
5 raw sliding-window length
6 compressed KV capacity
7 checkpoint token count
8 layer count
9 raw/head KV dimension
10 indexer head dimension
11 vocabulary size
12 live raw rows serialized below
Then it stores:
u32[token_count] checkpoint token IDs.float32[vocab_size] logits for the next token after that checkpoint.u32[layer_count] compressed attention row counts.u32[layer_count] ratio-4 indexer row counts.The logits are raw IEEE-754 float32 values from the host ds4_session
buffer. They are saved immediately after the checkpoint tokens so a loaded
snapshot can sample or continue from the exact next-token distribution without
running one extra decode step. MTP draft logits/state are not persisted; after
loading a disk checkpoint the draft state is invalidated and rebuilt by normal
generation.
The tensor payload is DS4-specific KV/session state, not a generic inference
graph dump. It is expected to be portable only across compatible ds4.c
builds for this model layout.
The cache stores checkpoints at four moments:
cold: after a long first prompt reaches a stable prefix, before generation.continued: when prefill or generation advances the live conversation by the configured interval.evict: before an unrelated request replaces the live in-memory session.shutdown: when the server exits cleanly.Cold saves intentionally trim a small token suffix and align down to a prefill chunk boundary. This avoids common BPE boundary retokenization misses when a future request appends text to the same prompt. The defaults are conservative: store prefixes of at least 512 tokens, cold-save prompts up to 30000 tokens, trim 32 tail tokens, and align to 2048-token chunks. The important knobs are:
--kv-cache-min-tokens--kv-cache-cold-max-tokens--kv-cache-continued-interval-tokens--kv-cache-boundary-trim-tokens--kv-cache-boundary-align-tokensBy default, checkpoints may be reused across the 2-bit and 4-bit routed-expert
variants if the token prefix matches. Use --kv-cache-reject-different-quant
when you want strict same-quant reuse only.
The cache directory is disposable. If behavior looks suspicious, stop the server and remove it. You can investigate what is cached with hexdump as the kv cache files include the verbatim prompt cached.
The default backend is Metal:
./ds4 -p "Hello" --metal
There is also a CPU reference/debug path:
./ds4 -p "Hello" --cpu
Do not treat the CPU path as the production target. The server is Metal-only, and the optimized implementation lives in the Metal graph path. This may change in the future.
tests/test-vectors contains short and long-context continuation vectors
captured from the official DeepSeek V4 Flash API. The requests use
deepseek-v4-flash, greedy decoding, thinking disabled, and the maximum
top_logprobs slice exposed by the API. Local vectors are generated with
./ds4 --dump-logprobs and compared by token bytes, so tokenizer/template or
attention regressions show up before they become long generation failures.
All project tests are driven by the C runner:
make test # ./ds4_test --all
./ds4_test --logprob-vectors
./ds4_test --server
I'm assuming this is faster, and/or lets you run a bigger, smarter model than just using the generic tool chain, but it doesn't spell out the level of existing improvements over that baseline or expected improvements as far as I can see?
Presumably you can work it out based on the numbers given if you have the relevant comparison values.
> DeepSeek made quite a splash in the AI industry by training its Mixture-of-Experts (MoE) language model with 671 billion parameters using a cluster featuring 2,048 Nvidia H800 GPUs in about two months, showing 10X higher efficiency than AI industry leaders like Meta. The breakthrough was achieved by implementing tons of fine-grained optimizations and usage of Nvidia's assembly-like PTX (Parallel Thread Execution) programming instead of Nvidia's CUDA for some functions,
https://www.tomshardware.com/tech-industry/artificial-intell...
Custom code targeting one specific hardware implementation can improve performance quite a bit.
This is also a fine example of a vibe-coded project with purpose, as you acknowledged.
Even if not perfect, if you publish on GH or HF, some other agent can maybe start there and not from zero. I did this for Ling-2.6-flash (107B-A7B4 MoE) that's the biggest llm I can ran for practical use on the other h/w I got for local llms (M2 Max). Even if MTP is not working well, still improvement on the current llama.cpp that does not run Ling-2.6-flash at all. This - https://huggingface.co/inclusionAI/Ling-2.6-flash/discussion.... The 4-bit quants are at https://huggingface.co/ljupco/Ling-2.6-flash-GGUF, the branch is at https://github.com/ljubomirj/llama.cpp/tree/LJ-Ling-2.6-flas....
Edit: Caching story makes a lot more sense for regular usage: > Claude Code may send a large initial prompt, often around 25k tokens, before it starts doing useful work. Keep --kv-disk-dir enabled: after the first expensive prefill, the disk KV cache lets later continuations or restarted sessions reuse the saved prefix instead of processing the whole prompt again.
I know this is flash, but….
But other than this guy, did our whole society seriously never flamegraph this stuff before we started requesting nuclear reactors colocated at data centers and like more than 10% of gdp?
Someone needs to answer because this isn’t even a m4 or m5… WHAT THE FUCK
Sidenote: shout out antirez love my redis :)
[1] https://codegolf.stackexchange.com/questions/215216/high-thr...
I think llama.cpp could have done a much better job supporting PC. Sure, some of it us due to bad vendor support but with so many users I am surprised we don't see more optimized inference on standard PCs
### Diagnosing parallelism pathologies (L1)
*Grid occupancy:* - `Grid_Size / Workgroup_Size >= CU count` (W7900 = 96, Strix Halo = 40)? - < 0.3 = massively undersubscribed. Fix grid FIRST. Micro-optimization will NOT help. - 0.3-1.0 = partially utilized; depends on VGPR/LDS pressure. - 1.0-4.0 = healthy; micro-optimization can help.
*Within-block distribution:* - Does the kernel do useful work across all threads, or is there an `if (threadIdx.x == 0)` gate around a serial top-k, reduction, or scan? For c=1 decode, many kernels can't grow the grid, but they can always parallelize inside the block. - `Scratch_Size > 0` from dynamically-indexed per-thread arrays is a strong secondary signal of the within-block pathology.
*Router top-k (within-block fix)*: - Kernel: `qwen35_router_select_kernel` @ c=1 decode - Before: grid=1 (can't help; num_tokens=1), blockDim=512, `if (threadIdx.x == 0)` gated 2048 serial compares. Scratch=144 B from spilled per-thread arrays. - Fix: warp-shuffle parallel argmax across the whole block + `__shared__` top_vals buffer eliminating the spill. - Result: 5.7× kernel speedup, +6.6% on 4K/D4K E2E.
They’ve dropped all the mac studio configs higher than 96 gb, as well as the base mac mini. They’re also rumored to be considering taking the Neo base config off the market.
This seems to be how they’re dealing with supply constraints for fab capacity and RAM.
Everyone who's betting their competency on the generosity of billionaires selling tokens for 1/10-1/20th of the cost, or a delusional future where capable OS models fit on consumer grade hardware are actually cooked.
If DS4 Flash peaks at 50W and is 280B parameters, does that mean DS4 Pro at 1.6T parameters would likely be 300W or so? And the latest GPT 5 and Opus which feel maybe comparable-ish around 500W? Is it fair to say that when I'm using Claude Code and it's "autofellating" or whatever I'm burning 500W in a datacenter somewhere during that time?
Are there any architectures that don't rely on feeding the entire history back into the chat?
Recurrent LLMs?
That said, I've found that most corporate environments are unintentionally hostile to this kind of optimization work. It's hard to justify until the work is already done. That means you often need people with the skills, means, and motivation to do this that are outside normal corporate constraints. There aren't many of those.
This is probably far from the raw intelligence provided by cloud providers.
Still, this shines more light on local LLMs for agentic workflows.
Nonetheless eventually i want to build an at-home system. I imagine some smaller local model could handle metadata assignment quite well.
edit: Though TIL Mac Studio doesn't offer 512GB anymore... DRAM shortage lol. Rough.
https://artificialanalysis.ai/models?models=gpt-5-5%2Cgpt-5-...
Also, can the engine support transparent mmap use for fetching weights from disk on-demand, at least when using pure CPU? (GPU inference might be harder, since it's not clear how page faults would interact with running a shader.)
If the latter test is successful, next would be testing Macs with more limited RAM, first running simple requests (would be quite slow) then larger batches (might be more worthwhile if one can partially amortize the cost of fetching weights from storage, and be bottlenecked by other factors).
You can argue whether the projection is too optimistic or not, but this project definitely made me a little bit optimistic on that end.
An example is https://blog.can.ac/2026/02/12/the-harness-problem/ for just improving edits.
Or if we could really steer these open source models using well structured plans, could we spend more time planning into a specific way and kick off the build over night (a la the night shift https://jamon.dev/night-shift)
They said the same thing about open source chess engines.
48 gb is enough for a capable LLM.
Doing that on consumer grade hardware is entirely possible. The bottleneck is CUDA and other intellectual property moats.
- Consumers of LLM inference (developers and hobbyists) will be more aware of compute cost, leading them to develop more token-efficient uses of LLM inference and be incentivized to pick the right model for the right job (instead of throwing Sonnet at the wall and follow up with Opus if that doesn't stick)
- A larger market for on-device (and therefore open weight) LLM's will probably result in more research concentrated on those inherently more efficient (because compute/memory-constrained) models.
I think that despite the inefficiencies, shifting the market towards local inference would be a net positive in terms of energy use. Remember that 50W might seem like a lot, but is still much less than what, let's say, a PS5 draws.
Also remember how AWS had the same promise and now we're just deploying stack after stack and need 'FinOps' teams to get us to be more resource-efficient?
I could write an engine that only uses 10W on your machine, but it wouldn't be meaningful if it was also 10X slower.
More power consumption is usually an indicator that the hardware is being fully utilized, all things equal (comparing GPU to GPU or CPU to CPU, not apples to oranges)
Claude Sonnet is probably running on a 8 GPU box that consumes 10 kW while Opus might use more like 50 kW but that's shared by a bunch of users thanks to batching.
Of course there will always be larger flagship models, but if you can count on decent on-device inference, it materially changes what you can build.
Data center energy use isn't simple to calculate because servers are configured to process a lot of requests in parallel. You're not getting an entire GPU cluster to yourself while your request is being processed. Your tokens are being processed in parallel with a lot of other people's requests for efficiency.
This is why some providers can offer a fast mode: Your request gets routed to servers that are tuned to process fewer requests in parallel for a moderate speedup. They charge you more for it because they can't fit as many requests into that server.
But you’re right I agree
In the corporate world they sadly don’t take kindly to performance profiling as a first class citizen
Granted I will say optimization without requirements may not be beneficial but at least profiling itself seems worthy if you have use cases.
A lot of us have been working in the network packet pusher software , distributed systems , distributed storage space
I’m happy to see more stuff like this :)
TLDR; I’ve not seen a lot of flamegraphs of Llm end to end … idk if anyone else has?
Plus, a Mac that's not running inference idles down to 1-5W, only drawing power when it needs to. Datacenters must maximize usage, individuals and their devices don't have to.
A Mac is also the rest of the personal computer!
Not really. That's going to land you somewhere in the 0.2-0.5 tokens a second range
Lovely as modern nvmes are they're not memory
They don't usually go into much detail, but the impression I get is that they think data centers are energy monsters full of overheated GPU's that need to be constantly replaced, while your phone is full of mostly unused compute capacity and will barely break a sweat if it's only serving queries for a single user at a time.
They don't seem to give much thought to the energy usage per user (or what this will potentially do to your phone battery), or how different phone-sized vs data center-sized models are in terms of capability.
[1] https://finance.yahoo.com/sectors/technology/articles/nvidia...
Optimizing things usually means "think of a way to do the same thing with less effort".
The trend is heading in the opposite direction, less options for strong consumer hardware and towards cloud based products. This is a memory issue more than anything. Nvidia is done selling their ddr7 to gamers and people with AI girlfriends.
It's not out of the realm of possibility, but I just want to make you aware that this would be a very surprising development in computing history.
> in the next few years a "good enough" model will run on entry-level hardware
But that's not my main argument is that its delusional for OP thinks its reasonable to expect that soon we'll be able to run models on consumer hardware that will be able to build basically most things,
But I do think there will be many compromises made for consumer electronics, I don't think the powers that be are eager to give consumers all the best memory (that should be clear by now) There's 3 DDR5 DRAM manufactures in the world that have to provide memory to all the world's militaries, governments, datacenters/corporations. Consumers are last priority.
And that's for laptops with unified memory. In the desktop space, 8GB discrete GPUs are going to be sticking around for a very long time.