Heh, I made something very similar for the Qwen3 models a while back. It only runs Qwen3, supports only some quants, loads from GGUF, and has inference optimized by Claude (in a loop). The whole thing is compact (just a couple of files) and easy to reason about. I made it for my students so they could tinker with it and learn (add different decoding strategies, add abliteration, etc.). Popular frameworks are large, complex, and harder to hack on, while educational projects usually focus on something outdated like GPT-2.

Even though the project was meant to be educational, it gave me an idea I can't get out of my head: what if we started building ultra-optimized inference engines tailored to an exact GPU+model combination? GPUs are expensive and harder to get with each day. If you remove enough abstractions and code directly to the exact hardware/model, you can probably optimize things quite a lot (I hope). Maybe run an agent which tries to optimize inference in a loop (like autoresearch), empirically testing speed/quality.

The only problem with this is that once a model becomes outdated, you have to do it all again from scratch.

I think especially with the ability for SOTA AI to optimize kernels more people should try their hand at making better inference for their specific hardware.

I have an older W7900 (RDNA3) which, besides 48GB of VRAM, has some pretty decent roofline specs - 123 FP16 TFLOPS/INT8 TOPS, 864 GB/s MBW, but has had notoriously bad support both from AMD (ROCm) as well as llama.cpp.

Recently I decided I'd like to turn the card into a dedicated agentic/coder endpoint and I started tuning a W8A8-INT8 model. Over the course of a few days of autolooping (about 800 iterations using a variety of frontier/SOTA models, Kimi K2.6 did surprisingly well), and I ended up with prefill +20% and decode +50% faster than the best llama.cpp numbers for Qwen3.6 MoE.

I'm currently grinding MTP and DFlash optimization on it, but I've been pretty pleased with the results, and will probably try Gemma 4 next.

Hmm, I'm unable to order more than 96GB RAM for a Mac studio, even with the M3 ultra or M4 Max. Is this au specific? However with the MacBook Pro I can specify 128GB with the M5 Mac.

https://www.apple.com/au/shop/buy-mac/mac-studio

Did I miss a simple motivating benchmark or goal?

I'm assuming this is faster, and/or lets you run a bigger, smarter model than just using the generic tool chain, but it doesn't spell out the level of existing improvements over that baseline or expected improvements as far as I can see?

Presumably you can work it out based on the numbers given if you have the relevant comparison values.

This is so sick. I'm really curious to see what focused effort on optimizing a single open source model can look like over many months. Not only on the inference serving side, but also on the harness optimization side and building custom workflows to narrow the gap between things frontier models can infer and deduce and what open source models natively lack due to size, training etc.

A random, funny, interesting and telling data point: my MacBook M3 Max while DS4 is generating tokens at full speed peaks 50W of energy usage...

It is great to see antirez working on local inference. A lightweight implementation for Metal makes running these models locally much more accessible.

Been working on local-first LLM observability for exactly this use case — tracing local model pipelines without sending data to cloud. Happy to share if anyone's interested.

Very impressive. One thing that seems odd to me is that is at like 4 minutes before it starts a response for large input? I don't use mac hardware for LLMs, but that is quite surprising and would seem to be a pretty large stumbling block for practical usage.

Edit: Caching story makes a lot more sense for regular usage: > Claude Code may send a large initial prompt, often around 25k tokens, before it starts doing useful work. Keep --kv-disk-dir enabled: after the first expensive prefill, the disk KV cache lets later continuations or restarted sessions reuse the saved prefix instead of processing the whole prompt again.

I keep seeing DS4 and in order my brain interprets it as Dark Souls 4 (sadface), DualShock 4, Deep Seek 4.

Large LLMs on MacBook produce tokens at an acceptable speed but the problem is reading context. Not incremental reading like when you have a chat session, because they use KV cache, but large size reading, like when you paste a big file. It can take minutes.

The beaty of it, that you can clone and make it, and it just works, no python shenanigans, what a blessing for this eco system.

Was excited until I realized DS flash is still enormous. Oh well...glad it exists anyway & happy to see antirez still doing fun stuff

I am curious about it producing less tokens except for the max mode. I love DeepSeek V4 Flash and I use it extensively, it's so cheap I can use it all day and still not use all my 10$ OpenCode Go subscription. I use it always in max mode because of this, but now I wonder whether I should rather use high.

Finally someone who pays proper respect to GGML ecosystem.

Great project!

This is also a fine example of a vibe-coded project with purpose, as you acknowledged.

How does this compare with oMLX?

So just gonna ask a question, probably will get downvoted

I know this is flash, but….

But other than this guy, did our whole society seriously never flamegraph this stuff before we started requesting nuclear reactors colocated at data centers and like more than 10% of gdp?

Someone needs to answer because this isn’t even a m4 or m5… WHAT THE FUCK

Sidenote: shout out antirez love my redis :)

The only problem with this is that once a model becomes outdated, you have to do it all again from scratch.

> what if we started building ultra-optimized inference engines tailored to an exact GPU+model combination?

The inference engines in use already include different backend building blocks optimized for different hardware.

While there are places where you can pick up some low hanging fruit for less popular platforms, there isn't a lot of room to squeeze in super optimized model-runners for specific GPU families and get much better performance. The core computations are already done by highly optimized kernels for each GPU.

There are forks of llama.cpp that have better optimizations for running on CPU architectures, but (barring maintainer disagreements) a better use of time is to target merging these improvements upstream instead of trying to make super specific model+GPU runners.

This takes me to the famous FizzBuzz High performance codegolf answer [1]. If we could implement optimizations like that for the inferences, maybe we could increase the speeds 10x or more.

[1] https://codegolf.stackexchange.com/questions/215216/high-thr...

I'll add to this: What if chips were designed for the model? What would happen if we moved from digital to analog (vectors are not represented as bits, but instead as voltages)? Could the compute heavy matrix multiplications be done via op-amps? And could this analog approach be way more efficient than the limitations of bit representation?

I've built something like this. One issue is that LLMs are actually terrible at writing good shaders. I've spent way too much time trying to get them not to be so awful at it.

Another suggestion for optimizing local inference - the Hermes team talks a lot on X about how much better results are when you use custom parsers tuned to the nuances of each model. Some models might like to use a trailing `,` in JSON output, some don't - so if your parser can handle the quirks of the specific model, then you get higher-performing functionality.

What if PyTorch is extended to have a pluggable compiler? For M GPU types and N models, if the backend allows, run a specialized compiler?

this feels closer to ATLAS/FFTW than a model runner. the generated kernel ages out, the tuning harness is the bit you actually want to keep.

I think especially with the ability for SOTA AI to optimize kernels more people should try their hand at making better inference for their specific hardware.

I'm currently grinding MTP and DFlash optimization on it, but I've been pretty pleased with the results, and will probably try Gemma 4 next.

In the same boat with 7900xtx. 24GB vram, on paper decent performance, in reality most things don't run. Only llama.cpp is consistent that it can run most models, even if maybe not at top performance (afaik - lacking MTP, problems cache invalidation with hybrid models). At least with llama.cpp I know what runs. With various python-based inferencers, between their uv/venv, my venv, system envs/pythons/libs yadayada - I need an agent to get to the bottom of what's actually running. :-) Yeah IK skill issue/user errors - but don't have seconds in the day left to spend them on that.

Even if not perfect, if you publish on GH or HF, some other agent can maybe start there and not from zero. I did this for Ling-2.6-flash (107B-A7B4 MoE) that's the biggest llm I can ran for practical use on the other h/w I got for local llms (M2 Max). Even if MTP is not working well, still improvement on the current llama.cpp that does not run Ling-2.6-flash at all. This - https://huggingface.co/inclusionAI/Ling-2.6-flash/discussion.... The 4-bit quants are at https://huggingface.co/ljupco/Ling-2.6-flash-GGUF, the branch is at https://github.com/ljubomirj/llama.cpp/tree/LJ-Ling-2.6-flas....

Please share your knowledge and your findings

I think llama.cpp could have done a much better job supporting PC. Sure, some of it us due to bad vendor support but with so many users I am surprised we don't see more optimized inference on standard PCs

> what if we started building ultra-optimized inference engines tailored to an exact GPU+model combination?

The inference engines in use already include different backend building blocks optimized for different hardware.

Deepseek's custom PTX code has previously outperformed CUDA running on Nvidia H800 GPUs.

> DeepSeek made quite a splash in the AI industry by training its Mixture-of-Experts (MoE) language model with 671 billion parameters using a cluster featuring 2,048 Nvidia H800 GPUs in about two months, showing 10X higher efficiency than AI industry leaders like Meta. The breakthrough was achieved by implementing tons of fine-grained optimizations and usage of Nvidia's assembly-like PTX (Parallel Thread Execution) programming instead of Nvidia's CUDA for some functions,

https://www.tomshardware.com/tech-industry/artificial-intell...

Custom code targeting one specific hardware implementation can improve performance quite a bit.

When you support multiple backends, you end up having to abstract over them. Each backend may implement the abstraction to the best of its capability, but you still have to deal with the abstraction sitting between your workload and its compute. Wouldn't it be nice if you didn't need that abstraction? That's what GP is talking about, I'm sure: optimizing the workload directly for the hardware, rather than merely the workload and the backend for the abstraction.

ds4.c

ds4.c is a small native inference engine for DeepSeek V4 Flash. It is intentionally narrow: not a generic GGUF runner, not a wrapper around another runtime, and not a framework. The main path is a DeepSeek V4 Flash-specific Metal graph executor with DS4-specific loading, prompt rendering, KV state, and server API glue.

This project would not exist without llama.cpp and GGML, make sure to read the acknowledgements section, a big thank you to Georgi Gerganov and all the other contributors.

Now, back at this project. Why we believe DeepSeek v4 Flash to be a pretty special model deserving a stand alone engine? Because after comparing it with powerful smaller dense models, we can report that:

DeepSeek v4 Flash is faster because of less active parameters.
In thinking mode, if you avoid max thinking, it produces a thinking section that is a lot shorter than other models, even 1/5 of other models in many cases, and crucially, the thinking section length is proportional to the problem complexity. This makes DeepSeek v4 Flash usable with thinking enabled when other models are practically impossible to use in the same conditions.
The model features a context window of 1 million tokens.
Being so large, it knows more things if you go sampling at the edge of knowledge. For instance asking about Italian show or political questions soon uncovers that 284B parameters are a lot more than 27B or 35B parameters.
It writes much better English and Italian. It feels a quasi-frontier model.
The KV cache is incredibly compress, allowing long context inference on local computers and on disk KV cache persistence.
It works well with 2-bit quantization, if quantized in a special way (read later). This allows to run it in MacBooks with 128GB of RAM.
We expect DeepSeek to release updated versions of v4 Flash in the future, even better than the current one.

That said, a few important things about this project:

The local inference landscape contains many excellent projects, but new models are released continuously, and the attention immediately gets captured by the next model to implement. This project takes a deliberately narrow bet: one model at a time, official-vector validation (logits obtained with the official implementation), long-context tests, and enough agent integration to know if it really works. The exact model may change as the landscape evolves, but the constraint remains: local inference credible on high end personal machines or Mac Studios, starting from 128GB of memory.
This software is developed with strong assistance from GPT 5.5 and with humans leading the ideas, testing, and debugging. We say this openly because it shaped how the project was built. If you are not happy with AI-developed code, this software is not for you. The acknowledgement below is equally important: this would not exist without llama.cpp and GGML, largely written by hand.
This implementation is based on the idea that compressed KV caches like the one of DeepSeek v4 and the fast SSD disks of modern MacBooks should change our idea that KV cache belongs to RAM. The KV cache It is actually a first class disk citizen.
Our vision is that local inference should be a set of three things working well together, out of the box: A) inference engine with HTTP API + B) GGUF specially crafted to run well under a given engine and given assumptions + C) testing and validation with coding agents implementations. This inference engine only runs with the GGUF files provided. It gets tested against officially obtained logits at different context sizes. This project exists because we wanted to make one local model feel finished end to end, not just runnable. However this is just alpha quality code, so probably we are not still there.
This is Metal-only, may implement CUDA support in the future? Perhaps, but nothing more. The CPU path is only for correctness check, but warning: current macOS versions have a bug in the virtual memory implementation that will crash the kernel if you try to run the CPU code. Remember? Software sucks. I was not possible to fix the CPU inference to avoid crashing, since each time there is to restart the computer, which is not funny. Help us, if you have the guts.

Acknowledgements to llama.cpp and GGML

ds4.c does not link against GGML, but it exists thanks to the path opened by the llama.cpp project and the kernels, quantization formats, GGUF ecosystem, and hard-won engineering knowledge developed there. We are thankful and indebted to llama.cpp and its contributors. Their implementation, kernels, tests, and design choices were an essential reference while building this DeepSeek V4 Flash-specific inference path. Some source-level pieces are retained or adapted here under the MIT license: GGUF quant layouts and tables, CPU quant/dot logic, and certain Metal kernels. For this reason, and because we are genuinely grateful, we keep the GGML authors copyright notice in our LICENSE file.

Model Weights

This implementation only works with the DeepSeek V4 Flash GGUFs published for this project. It is not a general GGUF loader, and arbitrary DeepSeek/GGUF files will not have the tensor layout, quantization mix, metadata, or optional MTP state expected by the engine. The 2 bit quantizations provided here are not a joke: they behave well, work under coding agents, call tools in a reliable way. The 2 bit quants use a very asymmetrical quantization: only the routed MoE experts are quantized, up/gate at IQ2_XXS, down at Q2_K. They are the majority of all the model space: the other components (shared experts, projections, routing) are left untouched to guarantee quality.

Download one main model:

./download_model.sh q2   # 128 GB RAM machines
./download_model.sh q4   # >= 256 GB RAM machines

The script downloads from https://huggingface.co/antirez/deepseek-v4-gguf, stores files under ./gguf/, resumes partial downloads with curl -C -, and updates ./ds4flash.gguf to point at the selected q2/q4 model. Authentication is optional for public downloads, but --token TOKEN, HF_TOKEN, or the local Hugging Face token cache are used when present.

./download_model.sh mtp fetches the optional speculative decoding support GGUF. It can be used with both q2 and q4, but must be enabled explicitly with --mtp. The current MTP/speculative decoding path is still experimental: it is correctness-gated and currently provides at most a slight speedup, not a meaningful generation-speed win.

Then build:

make

./ds4flash.gguf is the default model path used by both binaries. Pass -m to select another supported GGUF from ./gguf/. Run ./ds4 --help and ./ds4-server --help for the full flag list.

Speed

These are single-run Metal CLI numbers with --ctx 32768, --nothink, greedy decoding, and -n 256. The short prompt is a normal small Italian story prompt. The long prompts exercise chunked prefill plus long-context decode. Q4 requires the larger-memory machine class, so M3 Max Q4 numbers are N/A.

Machine	Quant	Prompt	Prefill	Generation
MacBook Pro M3 Max, 128 GB	q2	short	58.52 t/s	26.68 t/s
MacBook Pro M3 Max, 128 GB	q2	11709 tokens	250.11 t/s	21.47 t/s
MacBook Pro M3 Max, 128 GB	q4	short	N/A	N/A
MacBook Pro M3 Max, 128 GB	q4	long	N/A	N/A
Mac Studio M3 Ultra, 512 GB	q2	short	84.43 t/s	36.86 t/s
Mac Studio M3 Ultra, 512 GB	q2	11709 tokens	468.03 t/s	27.39 t/s
Mac Studio M3 Ultra, 512 GB	q4	short	78.95 t/s	35.50 t/s
Mac Studio M3 Ultra, 512 GB	q4	12018 tokens	448.82 t/s	26.62 t/s

CLI

One-shot prompt:

./ds4 -p "Explain Redis streams in one paragraph."

No -p starts the interactive prompt:

./ds4
ds4>

The interactive CLI is a real multi-turn DS4 chat. It keeps the rendered chat transcript and the live Metal KV checkpoint, so each turn extends the previous conversation. Useful commands are /help, /think, /think-max, /nothink, /ctx N, /read FILE, and /quit. Ctrl+C interrupts the current generation and returns to ds4>.

The CLI defaults to thinking mode. Use /nothink or --nothink for direct answers. --mtp MTP.gguf --mtp-draft 2 enables the optional MTP speculative path; it is useful only for greedy decoding, currently uses a confidence gate (--mtp-margin) to avoid slow partial accepts, and should be treated as an experimental slight-speedup path.

Server

Start a local OpenAI/Anthropic-compatible server:

./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192

The server is Metal-only. It keeps one mutable graph/KV checkpoint in memory, so stateless clients that resend a longer version of the same prompt can reuse the shared prefix instead of pre-filling from token zero.

Request parsing and sockets run in client threads, but inference itself is serialized through one Metal worker. The current server does not batch multiple independent requests together; concurrent requests wait their turn on the single live graph/session.

Supported endpoints:

GET /v1/models
GET /v1/models/deepseek-v4-flash
POST /v1/chat/completions
POST /v1/completions
POST /v1/messages

/v1/chat/completions accepts the usual OpenAI-style messages, max_tokens/max_completion_tokens, temperature, top_p, top_k, min_p, seed, stream, stream_options.include_usage, tools, and tool_choice. Tool schemas are rendered into DeepSeek's DSML tool format, and generated DSML tool calls are mapped back to OpenAI tool calls.

/v1/messages is the Anthropic-compatible endpoint used by Claude Code style clients. It accepts system, messages, tools, tool_choice, max_tokens, temperature, top_p, top_k, stream, stop_sequences, and thinking controls. Tool uses are returned as Anthropic tool_use blocks.

Both APIs support SSE streaming. In thinking mode, reasoning is streamed in the native API shape instead of being mixed into final text.

Minimal OpenAI example:

curl http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model":"deepseek-v4-flash",
    "messages":[{"role":"user","content":"List three Redis design principles."}],
    "stream":true
  }'

Agent Client Usage

ds4-server can be used by local coding agents that speak OpenAI-compatible chat completions. Start the server first, and set the client context limit no higher than the --ctx value you started the server with:

./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192

You can use larger context and larger cache if you wish. Full context of 1M tokens is going to use more or less 26GB of memory (compressed indexer alone will be like 22GB), so configure a context which makes sense in your system. With 128GB of RAM you would run the 2-bit quants, which are already 81GB, 26GB are going to be likely too much, so a context window of 100~300k tokens is wiser.

The 384000 output limit below avoids token caps since the model is able to generate very long replies otherwise (up to 384k tokens). The server still stops when the configured context window is full.

For opencode, add a provider and agent entry to ~/.config/opencode/opencode.json:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "ds4": {
      "name": "ds4.c (local)",
      "npm": "@ai-sdk/openai-compatible",
      "options": {
        "baseURL": "http://127.0.0.1:8000/v1",
        "apiKey": "dsv4-local"
      },
      "models": {
        "deepseek-v4-flash": {
          "name": "DeepSeek V4 Flash (ds4.c local)",
          "limit": {
            "context": 100000,
            "output": 384000
          }
        }
      }
    }
  },
  "agent": {
    "ds4": {
      "description": "DeepSeek V4 Flash served by local ds4-server",
      "model": "ds4/deepseek-v4-flash",
      "temperature": 0
    }
  }
}

For Pi, add a provider to ~/.pi/agent/models.json:

{
  "providers": {
    "ds4": {
      "name": "ds4.c local",
      "baseUrl": "http://127.0.0.1:8000/v1",
      "api": "openai-completions",
      "apiKey": "dsv4-local",
      "compat": {
        "supportsStore": false,
        "supportsDeveloperRole": false,
        "supportsReasoningEffort": true,
        "supportsUsageInStreaming": true,
        "maxTokensField": "max_tokens",
        "supportsStrictMode": false,
        "thinkingFormat": "deepseek",
        "requiresReasoningContentOnAssistantMessages": true
      },
      "models": [
        {
          "id": "deepseek-v4-flash",
          "name": "DeepSeek V4 Flash (ds4.c local)",
          "reasoning": true,
          "thinkingLevelMap": {
            "off": null,
            "minimal": "low",
            "low": "low",
            "medium": "medium",
            "high": "high",
            "xhigh": "xhigh"
          },
          "input": ["text"],
          "contextWindow": 100000,
          "maxTokens": 384000,
          "cost": {
            "input": 0,
            "output": 0,
            "cacheRead": 0,
            "cacheWrite": 0
          }
        }
      ]
    }
  }
}

Optionally make it the default Pi model in ~/.pi/agent/settings.json:

{
  "defaultProvider": "ds4",
  "defaultModel": "deepseek-v4-flash"
}

For Claude Code, use the Anthropic-compatible endpoint. A wrapper like this matches the local ~/bin/claude-ds4 setup:

#!/bin/sh
unset ANTHROPIC_API_KEY

export ANTHROPIC_BASE_URL="${DS4_ANTHROPIC_BASE_URL:-http://127.0.0.1:8000}"
export ANTHROPIC_AUTH_TOKEN="${DS4_API_KEY:-dsv4-local}"
export ANTHROPIC_MODEL="deepseek-v4-flash"

export ANTHROPIC_CUSTOM_MODEL_OPTION="deepseek-v4-flash"
export ANTHROPIC_CUSTOM_MODEL_OPTION_NAME="DeepSeek V4 Flash local ds4"
export ANTHROPIC_CUSTOM_MODEL_OPTION_DESCRIPTION="ds4.c local GGUF"

export ANTHROPIC_DEFAULT_SONNET_MODEL="deepseek-v4-flash"
export ANTHROPIC_DEFAULT_HAIKU_MODEL="deepseek-v4-flash"
export ANTHROPIC_DEFAULT_OPUS_MODEL="deepseek-v4-flash"
export CLAUDE_CODE_SUBAGENT_MODEL="deepseek-v4-flash"

export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1
export CLAUDE_CODE_DISABLE_NONSTREAMING_FALLBACK=1
export CLAUDE_STREAM_IDLE_TIMEOUT_MS=600000

exec "$HOME/.local/bin/claude" "$@"

Claude Code may send a large initial prompt, often around 25k tokens, before it starts doing useful work. Keep --kv-disk-dir enabled: after the first expensive prefill, the disk KV cache lets later continuations or restarted sessions reuse the saved prefix instead of processing the whole prompt again.

Thinking Modes

DeepSeek V4 Flash has distinct non-thinking, thinking, and Think Max modes. The server defaults to thinking mode. reasoning_effort=max requests Think Max, but it is only applied when the context size is large enough for the model card recommendation; smaller contexts fall back to normal thinking. OpenAI reasoning_effort=xhigh still maps to normal thinking, not Think Max.

For direct replies, use thinking: {"type":"disabled"}, think:false, or a non-thinking model alias such as deepseek-chat.

Disk KV Cache

Chat/completion APIs are stateless: agent clients usually resend the whole conversation every request. ds4-server handles this by comparing the rendered token stream with cached token prefixes. The live in-memory checkpoint covers the current session; the disk KV cache makes useful prefixes survive session switches and server restarts.

For RAM reasons there is currently only one live KV cache in memory. When a new unrelated session replaces it, the old checkpoint can only be resumed without re-processing if it was written to the disk KV cache. In other words, memory cache handles the active session; disk cache is the resume mechanism for different sessions.

Enable it with:

./ds4-server --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192

The cache key is the SHA1 of exact token IDs, not raw text. Each token ID is hashed as a little-endian 32-bit integer, and files are named <sha1>.kv. The file is intentionally written with ordinary read/write I/O, not mmap, so restoring cache entries does not add more VM mappings to a process that already maps the model.

On disk, a cache file is:

KVC fixed header, 48 bytes
u32 rendered_text_bytes
rendered_text_bytes of UTF-8-ish token text
DS4 session payload, payload_bytes from the KVC header

The fixed header is little-endian:

0   u8[3]  magic = "KVC"
3   u8     version = 1
4   u8     routed expert quant bits, currently 2 or 4
5   u8     save reason: 0 unknown, 1 cold, 2 continued, 3 evict, 4 shutdown
6   u8[2]  reserved
8   u32    cached token count
12  u32    hit count
16  u32    context size the snapshot was written for
20  u8[4]  reserved
24  u64    creation Unix time
32  u64    last-used Unix time
40  u64    DS4 session payload byte count

The rendered text is the tokenizer-decoded text for the cached token prefix. It is stored only for observability, so humans can inspect a cache directory without decoding token IDs. It is not used as the key and it is not trusted when loading; after load, the stored checkpoint tokens must still match the incoming request prefix.

The DS4 session payload starts with thirteen little-endian u32 fields:

0   magic = "DSV4"
1   payload version = 1
2   saved context size
3   prefill chunk size
4   raw KV ring capacity
5   raw sliding-window length
6   compressed KV capacity
7   checkpoint token count
8   layer count
9   raw/head KV dimension
10  indexer head dimension
11  vocabulary size
12  live raw rows serialized below

Then it stores:

u32[token_count] checkpoint token IDs.
float32[vocab_size] logits for the next token after that checkpoint.
u32[layer_count] compressed attention row counts.
u32[layer_count] ratio-4 indexer row counts.
For every layer: the live raw sliding-window KV rows, written in logical position order rather than physical ring order.
For compressed layers: live compressed KV rows and compressor frontier tensors.
For ratio-4 compressed layers: live indexer compressed rows and indexer frontier tensors.

The logits are raw IEEE-754 float32 values from the host ds4_session buffer. They are saved immediately after the checkpoint tokens so a loaded snapshot can sample or continue from the exact next-token distribution without running one extra decode step. MTP draft logits/state are not persisted; after loading a disk checkpoint the draft state is invalidated and rebuilt by normal generation.

The tensor payload is DS4-specific KV/session state, not a generic inference graph dump. It is expected to be portable only across compatible ds4.c builds for this model layout.

The cache stores checkpoints at four moments:

cold: after a long first prompt reaches a stable prefix, before generation.
continued: when prefill or generation advances the live conversation by the configured interval.
evict: before an unrelated request replaces the live in-memory session.
shutdown: when the server exits cleanly.

Cold saves intentionally trim a small token suffix and align down to a prefill chunk boundary. This avoids common BPE boundary retokenization misses when a future request appends text to the same prompt. The defaults are conservative: store prefixes of at least 512 tokens, cold-save prompts up to 30000 tokens, trim 32 tail tokens, and align to 2048-token chunks. The important knobs are:

--kv-cache-min-tokens
--kv-cache-cold-max-tokens
--kv-cache-continued-interval-tokens
--kv-cache-boundary-trim-tokens
--kv-cache-boundary-align-tokens

By default, checkpoints may be reused across the 2-bit and 4-bit routed-expert variants if the token prefix matches. Use --kv-cache-reject-different-quant when you want strict same-quant reuse only.

The cache directory is disposable. If behavior looks suspicious, stop the server and remove it. You can investigate what is cached with hexdump as the kv cache files include the verbatim prompt cached.

Backends

The default backend is Metal:

./ds4 -p "Hello" --metal

There is also a CPU reference/debug path:

./ds4 -p "Hello" --cpu

Do not treat the CPU path as the production target. The server is Metal-only, and the optimized implementation lives in the Metal graph path. This may change in the future.

Test Vectors

tests/test-vectors contains short and long-context continuation vectors captured from the official DeepSeek V4 Flash API. The requests use deepseek-v4-flash, greedy decoding, thinking disabled, and the maximum top_logprobs slice exposed by the API. Local vectors are generated with ./ds4 --dump-logprobs and compared by token bytes, so tokenizer/template or attention regressions show up before they become long generation failures.

All project tests are driven by the C runner:

make test                  # ./ds4_test --all
./ds4_test --logprob-vectors
./ds4_test --server

Did I miss a simple motivating benchmark or goal?

Presumably you can work it out based on the numbers given if you have the relevant comparison values.

Been working on local-first LLM observability for exactly this use case — tracing local model pipelines without sending data to cloud. Happy to share if anyone's interested.

Deepseek's custom PTX code has previously outperformed CUDA running on Nvidia H800 GPUs.

https://www.tomshardware.com/tech-industry/artificial-intell...

Custom code targeting one specific hardware implementation can improve performance quite a bit.

Great project!

This is also a fine example of a vibe-coded project with purpose, as you acknowledged.

Finally someone who pays proper respect to GGML ecosystem.

this feels closer to ATLAS/FFTW than a model runner. the generated kernel ages out, the tuning harness is the bit you actually want to keep.

How does this compare with oMLX?

The beaty of it, that you can clone and make it, and it just works, no python shenanigans, what a blessing for this eco system.

What if PyTorch is extended to have a pluggable compiler? For M GPU types and N models, if the backend allows, run a specialized compiler?

Yep that happens with coding agents sending a very large system prompt. And also when later tool calling feed it large files or diffs. But with the M3 ultra the prefill speed is almost 500 t/s that is quite into the very usable zone. With M3 max you need a bit more patience but it works well and as it emits the think process if you use the pi agent you don't wait: you read non censored chain of though. I posted a video on X yesterday using it with my m3 max. It spills tokens at a decent speed.

Prefill is faster on the M5s, the older generations are a bit weak.

There will always be a huge gap between frontier models and open source models (unless you're very rich). This whole industry makes no sense, everyone is ignoring the unit economics. It cost 20k a month to running Kimi 2.6 at decent tok/ps, to sell those tokens at a profit you'd need your hardware costs to be less 1k a month.

Everyone who's betting their competency on the generosity of billionaires selling tokens for 1/10-1/20th of the cost, or a delusional future where capable OS models fit on consumer grade hardware are actually cooked.

Hmm, I'm unable to order more than 96GB RAM for a Mac studio, even with the M3 ultra or M4 Max. Is this au specific? However with the MacBook Pro I can specify 128GB with the M5 Mac.

https://www.apple.com/au/shop/buy-mac/mac-studio

It’s not just AU: https://9to5mac.com/2026/05/05/apples-most-powerful-mac-stud...

They’ve dropped all the mac studio configs higher than 96 gb, as well as the base mac mini. They’re also rumored to be considering taking the Neo base config off the market.

This seems to be how they’re dealing with supply constraints for fab capacity and RAM.

The studio is really old now. The new one will drop at some point no doubt with more memory options. the 128GB M5 max MBP is great though

A random, funny, interesting and telling data point: my MacBook M3 Max while DS4 is generating tokens at full speed peaks 50W of energy usage...

"Data centers for LLMs are technically more energy efficient per-user than self-hosting LLM models due to economies-of-scale" is a data point the internet isn't ready for.

It's so interesting to think about how much power it takes these machines to "think". I think I had a vague notion that it was "a lot" but it's good to put a number on it.

If DS4 Flash peaks at 50W and is 280B parameters, does that mean DS4 Pro at 1.6T parameters would likely be 300W or so? And the latest GPT 5 and Opus which feel maybe comparable-ish around 500W? Is it fair to say that when I'm using Claude Code and it's "autofellating" or whatever I'm burning 500W in a datacenter somewhere during that time?

Not everybody might realize this, but this is a truly excellent and very impressive result. Most models on my M4 Max run at 150W consumption.

That a serious number? By the way, how does a hardware normie like me even measure this?

equals 2 or 3 human brains in power usage. Amazing work!

I think I’ve seen about 60 watt total system whenever I’ve used a local model on a MacBook Pro or a Mac Studio. Baseline for the Mac Studio is like 10 W and like 6 W for the MacBook Pro.

It is great to see antirez working on local inference. A lightweight implementation for Metal makes running these models locally much more accessible.

So just gonna ask a question, probably will get downvoted

I know this is flash, but….

But other than this guy, did our whole society seriously never flamegraph this stuff before we started requesting nuclear reactors colocated at data centers and like more than 10% of gdp?

Someone needs to answer because this isn’t even a m4 or m5… WHAT THE FUCK

Sidenote: shout out antirez love my redis :)

This is built atop a tower of stuff people built with profiling and performance-oriented design.

That said, I've found that most corporate environments are unintentionally hostile to this kind of optimization work. It's hard to justify until the work is already done. That means you often need people with the skills, means, and motivation to do this that are outside normal corporate constraints. There aren't many of those.

DSv4 generates much faster on NVIDIA class hardware. It is just a very efficient model.

Every lab has a bunch of people doing nothing but optimizing.

The world is not China.

DS4 can process 460 prompt tokens per second. Not stellar but not so slow. On M3 max. See the benchmarks on readme.

Can you ELI5 why this is so slow for local inference but so fast for using hosted models?

And unless I'm mistaken, the repo is about running it with 2bit quantization.

This is probably far from the raw intelligence provided by cloud providers.

Still, this shines more light on local LLMs for agentic workflows.

Why is this the case?

Are there any architectures that don't rely on feeding the entire history back into the chat?

Recurrent LLMs?

I keep seeing DS4 and in order my brain interprets it as Dark Souls 4 (sadface), DualShock 4, Deep Seek 4.

Was excited until I realized DS flash is still enormous. Oh well...glad it exists anyway & happy to see antirez still doing fun stuff

It could run viably with SSD offload on Macs with very little memory. You could even exploit batching to make the model almost compute limited even in that challenging setting, seeing as the KV cache is so extremely small (for non-humongous context). In fact, if that approach can be made to work I'd like to see a comparison between DS4 Flash and Pro on the same (Mac) hardware.

What do you use it for? I tend to just stick to SOTA (Claude 4.7 Max thinking), and put up with the slow req/response. I'm not sure what type of work i'd trust a less thinking model, as my intuition is built around what Claude vSOTA Max can handle.

Nonetheless eventually i want to build an at-home system. I imagine some smaller local model could handle metadata assignment quite well.

edit: Though TIL Mac Studio doesn't offer 512GB anymore... DRAM shortage lol. Rough.

On max it uses more than twice as many tokens as on high when running the ArtificialAnalysis benchmark suite, and then it's indeed the model with the highest token usage (among the current top tier models). See the "Intelligence vs. Token Use" chart here:

https://artificialanalysis.ai/models?models=gpt-5-5%2Cgpt-5-...

How has opencode go been for you? Worth changing over from Claude pro?

This takes me to the famous FizzBuzz High performance codegolf answer [1]. If we could implement optimizations like that for the inferences, maybe we could increase the speeds 10x or more.

[1] https://codegolf.stackexchange.com/questions/215216/high-thr...

I love scrolling and reading through this, thinking yeah of course Python is slower than Java, oh wow Rust is pretty on par I wonder what the Java devs did. Then you hit asm and your jaw drops.

I've built something like this. One issue is that LLMs are actually terrible at writing good shaders. I've spent way too much time trying to get them not to be so awful at it.

I tried getting any sota llm (GPT 5, Opus 4.6, Deepseek V4 pro, glm-5) to write a Metal 4 shader for a bottle usdz and none of them got it right. They screwed up the normals and textures , total mess. I tried it to do it in Metal 3 and still crappy.

Just curious if you've tried GPT 5.5 Pro?

There is https://taalas.com/ . Their chips are all digital though. The weights are written to silicon.

Please share your knowledge and your findings

When it's in a good state I'll open source it, I am keeping track of what optimizations make the most impact, stuff like this:

### Diagnosing parallelism pathologies (L1)

*Grid occupancy:* - `Grid_Size / Workgroup_Size >= CU count` (W7900 = 96, Strix Halo = 40)? - < 0.3 = massively undersubscribed. Fix grid FIRST. Micro-optimization will NOT help. - 0.3-1.0 = partially utilized; depends on VGPR/LDS pressure. - 1.0-4.0 = healthy; micro-optimization can help.

*Within-block distribution:* - Does the kernel do useful work across all threads, or is there an `if (threadIdx.x == 0)` gate around a serial top-k, reduction, or scan? For c=1 decode, many kernels can't grow the grid, but they can always parallelize inside the block. - `Scratch_Size > 0` from dynamically-indexed per-thread arrays is a strong secondary signal of the within-block pathology.

*Router top-k (within-block fix)*: - Kernel: `qwen35_router_select_kernel` @ c=1 decode - Before: grid=1 (can't help; num_tokens=1), blockDim=512, `if (threadIdx.x == 0)` gated 2048 serial compares. Scratch=144 B from spilled per-thread arrays. - Fix: warp-shuffle parallel argmax across the whole block + `__shared__` top_vals buffer eliminating the spill. - Result: 5.7× kernel speedup, +6.6% on 4K/D4K E2E.

When it's in a good state I'll open source it, I am keeping track of what optimizations make the most impact, stuff like this:

### Diagnosing parallelism pathologies (L1)

Prefill is faster on the M5s, the older generations are a bit weak.

It’s not just AU: https://9to5mac.com/2026/05/05/apples-most-powerful-mac-stud...

They’ve dropped all the mac studio configs higher than 96 gb, as well as the base mac mini. They’re also rumored to be considering taking the Neo base config off the market.

This seems to be how they’re dealing with supply constraints for fab capacity and RAM.

The studio is really old now. The new one will drop at some point no doubt with more memory options. the 128GB M5 max MBP is great though

Given how small the KV cache for this model seems to be for small contexts, can you clarify how the engine behaves if you try to run increasingly larger batches on your prosumer hardware (RAM 128 GB)? Does it eventually become compute limited?

Also, can the engine support transparent mmap use for fetching weights from disk on-demand, at least when using pure CPU? (GPU inference might be harder, since it's not clear how page faults would interact with running a shader.)

If the latter test is successful, next would be testing Macs with more limited RAM, first running simple requests (would be quite slow) then larger batches (might be more worthwhile if one can partially amortize the cost of fetching weights from storage, and be bottlenecked by other factors).

If you looked at a graph of GPU power in consumer hardware and model capability per billion parameters over time, it seems inevitable that in the next few years a "good enough" model will run on entry-level hardware.

Of course there will always be larger flagship models, but if you can count on decent on-device inference, it materially changes what you can build.

I am not sure where this comment is from (possibly without looking at this project?). This project is running quasi-frontier model at reasonable tps (~30) with reasonable prefill performance (~500tps) with a high-end laptop. People simply project what they see from this project to what you optimistically can expect.

You can argue whether the projection is too optimistic or not, but this project definitely made me a little bit optimistic on that end.

There will always be a gap, but what's interesting is that because new models are constantly coming out, we as an industry never spend any time extracting the maximal value out of an existing model. What if there are techniques, and harness workflows that could be optimized for a singular model end to end? How far can that push the state of the art.

An example is https://blog.can.ac/2026/02/12/the-harness-problem/ for just improving edits.

Or if we could really steer these open source models using well structured plans, could we spend more time planning into a specific way and kick off the build over night (a la the night shift https://jamon.dev/night-shift)

Most tasks do not require frontier models, so as long as these models cover 95-99 per cent of the tasks, closed frontier models can be left for niche and specialized cases that are harder.

> There will always be a huge gap between frontier models and open source models (unless you're very rich).

They said the same thing about open source chess engines.

> a delusional future where capable OS models fit on consumer grade hardware

48 gb is enough for a capable LLM.

Doing that on consumer grade hardware is entirely possible. The bottleneck is CUDA and other intellectual property moats.

"Data centers for LLMs are technically more energy efficient per-user than self-hosting LLM models due to economies-of-scale" is a data point the internet isn't ready for.

Not everybody might realize this, but this is a truly excellent and very impressive result. Most models on my M4 Max run at 150W consumption.

It's so interesting to think about how much power it takes these machines to "think". I think I had a vague notion that it was "a lot" but it's good to put a number on it.

I think I’ve seen about 60 watt total system whenever I’ve used a local model on a MacBook Pro or a Mac Studio. Baseline for the Mac Studio is like 10 W and like 6 W for the MacBook Pro.

DSv4 generates much faster on NVIDIA class hardware. It is just a very efficient model.

The world is not China.

DS4 can process 460 prompt tokens per second. Not stellar but not so slow. On M3 max. See the benchmarks on readme.

Can you ELI5 why this is so slow for local inference but so fast for using hosted models?

Why is this the case?

Are there any architectures that don't rely on feeding the entire history back into the chat?

Recurrent LLMs?

Just curious if you've tried GPT 5.5 Pro?

There is https://taalas.com/ . Their chips are all digital though. The weights are written to silicon.

But if you're running it on your own hardware you might only generate tokens when you have something useful to do with them, instead of every time you load a Google search results page because Google decided the future is stuffing Gemini-generated answers down your eyeballs instead of letting you read it yourself from the primary source for 0.1 watts.

If LLM's were a mature product then this would be true at some point. However, you could argue (and I will) that the popularization of on-device LLM inference will lead to two things:

- Consumers of LLM inference (developers and hobbyists) will be more aware of compute cost, leading them to develop more token-efficient uses of LLM inference and be incentivized to pick the right model for the right job (instead of throwing Sonnet at the wall and follow up with Opus if that doesn't stick)

- A larger market for on-device (and therefore open weight) LLM's will probably result in more research concentrated on those inherently more efficient (because compute/memory-constrained) models.

I think that despite the inefficiencies, shifting the market towards local inference would be a net positive in terms of energy use. Remember that 50W might seem like a lot, but is still much less than what, let's say, a PS5 draws.

Also remember how AWS had the same promise and now we're just deploying stack after stack and need 'FinOps' teams to get us to be more resource-efficient?

Separate to the self-host/datacentre argument, it would be interesting to see a speed/performance/watts-per-token leaderboard between leading models. Which model is the most watt-efficient?

This is neither a controversial take nor a reason to prefer third-party hosting over self-hosting, so I don't think the internet really needs to be ready for it.

I thought this is a pretty generally accepted fact?

This is pretty much true for all applications.

There's a bunch of companies doing garage GPU datacenters now. Probably can act as a heat source during winter too if you have a heat pump.

Using only this dimension in a vacuum, it sounds like an easy choice, but we're extremely early in this market, and the big providers are already a mess of pricing choices, pricing changes, and sudden quota adjustments for consumers.

Plus, a Mac that's not running inference idles down to 1-5W, only drawing power when it needs to. Datacenters must maximize usage, individuals and their devices don't have to.

A Mac is also the rest of the personal computer!

Power consumption numbers aren't useful for efficiency calculations without also considering the tokens per second for the same model and quantization.

I could write an engine that only uses 10W on your machine, but it wouldn't be meaningful if it was also 10X slower.

More power consumption is usually an indicator that the hardware is being fully utilized, all things equal (comparing GPU to GPU or CPU to CPU, not apples to oranges)

There isn't a relationship between parameter size and energy use like that. You could run a 280B parameter model on a Raspberry Pi with a big SSD if you were so determined. The energy use would be small, but you would be waiting a very long time for your response.

Data center energy use isn't simple to calculate because servers are configured to process a lot of requests in parallel. You're not getting an entire GPU cluster to yourself while your request is being processed. Your tokens are being processed in parallel with a lot of other people's requests for efficiency.

This is why some providers can offer a fast mode: Your request gets routed to servers that are tuned to process fewer requests in parallel for a moderate speedup. They charge you more for it because they can't fit as many requests into that server.

Energy use for any given request is going to be roughly proportional to active parameters, not total. That would be something like 13B for Flash and 49B for Pro. So you'd theoretically get something like 190W if you could keep the same prefill and decode speed as Flash, which is unlikely.

Batching lowers that, since the model is read once from memory. Activation accumulation doesn't scale as nicely

Power isn't proportional to parameters. It may be vaguely proportional to tokens/s although batching screws that up.

Claude Sonnet is probably running on a 8 GPU box that consumes 10 kW while Opus might use more like 50 kW but that's shared by a bunch of users thanks to batching.

That a serious number? By the way, how does a hardware normie like me even measure this?

Most components have built in power measurement (although some are more accurate than others). Apps like Intel Power Gadget, Mx Power Gadget, Afterburner, Adrenalin, etc. can show power usage in real time.

equals 2 or 3 human brains in power usage. Amazing work!

True quantitatively, not qualitatively. DeepSeek V4 is not capable of doing what a human brain can do, of course, but for the tasks it can do, it can do it at a speed which is completely impossible for a human, so comparing the two requires some normalization for speed.

This is built atop a tower of stuff people built with profiling and performance-oriented design.

Building this into agentic dev workflows (subject to token/time constraints) is something I spent a lot of time doing at work. I actually am kind of proud of that hahah

But you’re right I agree

In the corporate world they sadly don’t take kindly to performance profiling as a first class citizen

Granted I will say optimization without requirements may not be beneficial but at least profiling itself seems worthy if you have use cases.

A lot of us have been working in the network packet pusher software , distributed systems , distributed storage space

I’m happy to see more stuff like this :)

TLDR; I’ve not seen a lot of flamegraphs of Llm end to end … idk if anyone else has?

Every lab has a bunch of people doing nothing but optimizing.

And unless I'm mistaken, the repo is about running it with 2bit quantization.

This is probably far from the raw intelligence provided by cloud providers.

Still, this shines more light on local LLMs for agentic workflows.

It runs both q2 and original (4 bit routed experts). At the same speed more or less. The q2 quants are not what you could expect: it works extremely well for a few reasons. For the full model you need a Mac with 256GB.

>It could run viably with SSD offload on Macs with very little memory

Not really. That's going to land you somewhere in the 0.2-0.5 tokens a second range

Lovely as modern nvmes are they're not memory

Nonetheless eventually i want to build an at-home system. I imagine some smaller local model could handle metadata assignment quite well.

edit: Though TIL Mac Studio doesn't offer 512GB anymore... DRAM shortage lol. Rough.

I am experimenting with some game development and my thesis' beamer. I have a 20$ Codex account and I use GPT-5.5 for planning and DeepSeek for executing in OpenCode. This makes my Codex 5h tokens to last more than 10 minutes.

Apple just dropped the 128GB option as well.

How has opencode go been for you? Worth changing over from Claude pro?

https://artificialanalysis.ai/models?models=gpt-5-5%2Cgpt-5-...

I love scrolling and reading through this, thinking yeah of course Python is slower than Java, oh wow Rust is pretty on par I wonder what the Java devs did. Then you hit asm and your jaw drops.

You can argue whether the projection is too optimistic or not, but this project definitely made me a little bit optimistic on that end.

An example is https://blog.can.ac/2026/02/12/the-harness-problem/ for just improving edits.

> There will always be a huge gap between frontier models and open source models (unless you're very rich).

They said the same thing about open source chess engines.

> a delusional future where capable OS models fit on consumer grade hardware

48 gb is enough for a capable LLM.

Doing that on consumer grade hardware is entirely possible. The bottleneck is CUDA and other intellectual property moats.

If LLM's were a mature product then this would be true at some point. However, you could argue (and I will) that the popularization of on-device LLM inference will lead to two things:

- A larger market for on-device (and therefore open weight) LLM's will probably result in more research concentrated on those inherently more efficient (because compute/memory-constrained) models.

Also remember how AWS had the same promise and now we're just deploying stack after stack and need 'FinOps' teams to get us to be more resource-efficient?

This is neither a controversial take nor a reason to prefer third-party hosting over self-hosting, so I don't think the internet really needs to be ready for it.

This is pretty much true for all applications.

Power consumption numbers aren't useful for efficiency calculations without also considering the tokens per second for the same model and quantization.

I could write an engine that only uses 10W on your machine, but it wouldn't be meaningful if it was also 10X slower.

More power consumption is usually an indicator that the hardware is being fully utilized, all things equal (comparing GPU to GPU or CPU to CPU, not apples to oranges)

Batching lowers that, since the model is read once from memory. Activation accumulation doesn't scale as nicely

Power isn't proportional to parameters. It may be vaguely proportional to tokens/s although batching screws that up.

Claude Sonnet is probably running on a 8 GPU box that consumes 10 kW while Opus might use more like 50 kW but that's shared by a bunch of users thanks to batching.

I've found that opencode and codex are the two subscriptions that still seem to subsize usage. Deepseek V4 has been the most powerful model in opencode IMO, I trust it with problems where I can validate the solution such as debugging an issue - but I only trust the proprietary GPT-5.5 and Claude Opus 4.7 models for writing code that matters.

Given the price, extremely satisfied, especially thanks to DeepSeek V4 Flash that makes it last forever. I use it on top of my 20$ Codex which is great but tokens last nothing.

Wow, the difference is quite considerable and the gain in intelligence is not that much. I might try to use high and just iterate more often. I am working with hobby stuff so I don't have to worry whether it breaks things or not.

Check out cpp at 208.3 GiB/s, 3x faster than asm.

Of course there will always be larger flagship models, but if you can count on decent on-device inference, it materially changes what you can build.

It also massively changes the value economics of the frontier models. In a lot of cases, you really don't need a general purpose intelligence model too.

No offense, this is a crazy delusional statement.

Most tasks do not require frontier models, so as long as these models cover 95-99 per cent of the tasks, closed frontier models can be left for niche and specialized cases that are harder.

Frontier models can hardly do the tasks I want them too, I simply cannot buy into this notion.

Separate to the self-host/datacentre argument, it would be interesting to see a speed/performance/watts-per-token leaderboard between leading models. Which model is the most watt-efficient?

I thought this is a pretty generally accepted fact?

There's a bunch of companies doing garage GPU datacenters now. Probably can act as a heat source during winter too if you have a heat pump.

Building this into agentic dev workflows (subject to token/time constraints) is something I spent a lot of time doing at work. I actually am kind of proud of that hahah

But you’re right I agree

In the corporate world they sadly don’t take kindly to performance profiling as a first class citizen

Granted I will say optimization without requirements may not be beneficial but at least profiling itself seems worthy if you have use cases.

A lot of us have been working in the network packet pusher software , distributed systems , distributed storage space

I’m happy to see more stuff like this :)

TLDR; I’ve not seen a lot of flamegraphs of Llm end to end … idk if anyone else has?

Given the price, extremely satisfied, especially thanks to DeepSeek V4 Flash that makes it last forever. I use it on top of my 20$ Codex which is great but tokens last nothing.

Don't worry, capitalism takes care of that.

I've seen plenty of people on HN claim that LLM's running on their phones is the obvious future in terms of not just privacy but also efficiency, i.e. better along every possible metric.

They don't usually go into much detail, but the impression I get is that they think data centers are energy monsters full of overheated GPU's that need to be constantly replaced, while your phone is full of mostly unused compute capacity and will barely break a sweat if it's only serving queries for a single user at a time.

They don't seem to give much thought to the energy usage per user (or what this will potentially do to your phone battery), or how different phone-sized vs data center-sized models are in terms of capability.

That's an interesting idea [1], the value being that its easier to build servers into a bunch of homes that are being built than building a datacenter. Every now and then something reminds me of "Dad's Nuke", a novel by Marc Laidlaw, about a family that has a nuclear reactor in their basement. A really bizarre, memorable satire [2].

[1] https://finance.yahoo.com/sectors/technology/articles/nvidia...

[2] https://en.wikipedia.org/wiki/Dad%27s_Nuke

You're thinking about power use, not energy. There are systems that can more directly minimize energy per operation at the cost of high latency but they look more like TPUs than Raspberry Pi's.

Plus, a Mac that's not running inference idles down to 1-5W, only drawing power when it needs to. Datacenters must maximize usage, individuals and their devices don't have to.

A Mac is also the rest of the personal computer!

But it's simply an economic fact that EoS will be more efficient with a task that's so easy to offload somewhere else.

I'm sure human brain, at least my present brain, is incapable of many things DeepSeek V4 can do. Qualitatively.

Out of curiosity, do you have any theories of why it works so well at such aggressive quantization levels?

Apple just dropped the 128GB option as well.

It is still available for the M5 Max Macbook Pro, but yes, the Mac Studio is now only offered with up to 96 GB.

>It could run viably with SSD offload on Macs with very little memory

Not really. That's going to land you somewhere in the 0.2-0.5 tokens a second range

Lovely as modern nvmes are they're not memory

You can run multiple inferences in parallel on the same set of weights, that's what batching is. Given enough parallelization it can be almost entirely compute-limited, at least for small context (max ~10GB per request apparently, but that's for 1M tokens!)

Check out cpp at 208.3 GiB/s, 3x faster than asm.

Yeah, because (and here's the trick) they are clever and do less work.

Optimizing things usually means "think of a way to do the same thing with less effort".

It also massively changes the value economics of the frontier models. In a lot of cases, you really don't need a general purpose intelligence model too.

No offense, this is a crazy delusional statement.

Frontier models can hardly do the tasks I want them too, I simply cannot buy into this notion.

Don't worry, capitalism takes care of that.

Exactly… as hn readers, we sometimes forget that a lot of people are using these tools to search for the best sunscreen, or rewrite an email.

No offense, this is a crazy worthless contribution to the discussion.

Why?

I've seen plenty of people on HN claim that LLM's running on their phones is the obvious future in terms of not just privacy but also efficiency, i.e. better along every possible metric.

[1] https://finance.yahoo.com/sectors/technology/articles/nvidia...

[2] https://en.wikipedia.org/wiki/Dad%27s_Nuke

You're thinking about power use, not energy. There are systems that can more directly minimize energy per operation at the cost of high latency but they look more like TPUs than Raspberry Pi's.

But it's simply an economic fact that EoS will be more efficient with a task that's so easy to offload somewhere else.

I'm sure human brain, at least my present brain, is incapable of many things DeepSeek V4 can do. Qualitatively.

Hacker Times

Hacker Times

DeepSeek 4 Flash local inference engine for Metal

Discussion

Discussion

ds4.c

Acknowledgements to llama.cpp and GGML

Model Weights

Speed

CLI

Server

Agent Client Usage

Thinking Modes

Disk KV Cache

Backends

Test Vectors