Two Qwen3 models on one DGX Spark: the residency math

I started with antirez' DwarfStar[1] on one spark and that (~11-14tok/s generation, ~300-400 tok/s prompt processing) was enough of a taste for me to jump into 2 sparks, running the native quant of DSv4 Flash.

Now at 40-50tok/s generation and ~2000 tok/s prefill with a model that I've seen reason through race conditions and be able to trivially pull off any straight-forward coding task, and remain coherent at 500k context. With a preview checkpoint of the weights!

I'm excited for the future of local LLMs. There is some buy-in but apparently not an extreme amount to get access to models that can stand in the for the giants on all but the most challenging and/or hands-off coding tasks.

[1]: https://github.com/antirez/ds4

I’ve been considering a move to local llm setup, having been underwhelmed coat vs value of various online offerings. But at the same time worried anything I get will be obsolete in a couple months. And I don’t want to have to babysit it. I really want some agents managing and creating side hustles for me and have some other things. I’m technical-have written my own harness and use gh copilot and grok daily and have a hosted openwebui+openrouter thing. I’m also torn between a 128g MacBook Pro or a framework, or spark or similar and lightweight laptop to access. Would love advice anyone has for (or against) going local. I have asked ai but have analysis paralysis as 5k would be a big investment for me so I want to make right choices

Have you tried llama.cpp with unsloth and models suited to it? GLM flash? It seemed to allow more models to be tried soon after they are released. Haven’t tried for long term deployment though, that’s the next step.

2 LLMs at the same time? I've always wanted to do that

How about Qwen3.6? What sort of prefill/decode rates?

Edit: 3.6 not 3.7!

Author here. Quick context the post doesn't quite spell out:

The tool_choice="auto" failure on Qwen3-Next isn't a parser issue — the model reasons inside <think>, decides, and never emits the tool call. No error, just empty tool_calls. The fix was swapping the backbone from Thinking to Instruct, not tuning any parser flag.

The "load the bigger model first, size the smaller against actual residency" playbook generalizes to anything with shared CUDA framework overhead. The ~5 GiB framework floor shows up even at small gpu_memory_utilization values — plan against actuals, not targets.

[1]: https://github.com/antirez/ds4

> Now at 40-50tok/s generation and ~2000 tok/s

Not clear how you went from ~11-14 to ~40-50 tok/s. Is it by running the quant native model and adding a second Spark?

Cheers

> Now at 40-50tok/s generation and ~2000 tok/s

Not clear how you went from ~11-14 to ~40-50 tok/s. Is it by running the quant native model and adding a second Spark?

Cheers

Just chiming in - the claims above are real, I have very similar numbers in a cluster of 2x GX10 I have access to.

Instructions to reproduce, and benchmarks here: https://forums.developer.nvidia.com/t/deepseek-v4-flash-offi...

I suspect DwarfStar could probably squeeze more performance out of the single spark, maybe up closer to 20tok/s.

Moving to 2 sparks meant switching to vLLM with 2-way tensor parallelism and working multi-token prediction. The parallelism and MTP on top of better tuned kernels[1] gave an extremely nice boost! I was quite pleased. I've seen bursts up to 60tok/s at ~150k context - sometimes the MTP seems to really kick in (i.e. high acceptance rate on its tokens)

Currently running a custom vLLM build put together by some folks on the Nvidia forums[2], which speaks to how early support for the model is.

[1]: https://github.com/lukealonso/b12x

[2]: https://forums.developer.nvidia.com/t/372268

2 LLMs at the same time? I've always wanted to do that

Well, if you are making side-hustle money now using online models that, critically, you could also run at home, then it sounds like it’s just a matter of numbers. Oh and, unless you spend a lot more than 5k, your local model will still be slower than the online model. What’s your estimated ROI?

Assuming that’s not true based on your phrasing, you’d be shooting yourself in the foot. Start using online models with the same quant at least benchmark as what you could run at home. Prepare for the at home model to be slower.

Mac, DGX Spark, and a Framework Desktop / Ryzen AI Max 395 (ie Strix Halo) will not give you great performance running LLMs. One benefit of the Spark over the others is you can easily link up to 4 of them. Only MoE (sparse) models will be usable. Even if you can run some massive models, they will crawl. You're better off running one or more GPU cards.

You probably want to try renting some time on a dedicated box with roughly the specs you’re considering and running the open models for a bit to see if you would actually use them before dropping a lot on local hardware. A 128 gig MacBook Pro isn’t going to get you an amazing model, and certainly not amazing speed. GLM 5.2 wants something like 350+ gigs at fp4 iirc.

How about Qwen3.6? What sort of prefill/decode rates?

Edit: 3.6 not 3.7!

Author here. Quick context the post doesn't quite spell out:

Highy anecdotal: I have tried various self-hosted models using both vllm and llama.cpp. I am in a situation where I have access to large amount of memory (~320 GB).

While experimenting with quantization I found that there is a non-trivial tradeoff between quality and memory footprint. Overall my experience follows the reported pattern of "2-bit is mwah, 4-bit half decent and 6-bit required for programming. Still, although MiniMax-m2.7 is useable with the 6-bit quantizations that unsloth provides, it felt like such a breath of fresh air when I used the reference full-size model.

I find it difficult to say why. I had mostly the same setup as before (parsing had to be slightly adjusted in Zed). Aside from not experiencing the thinking loops (where minimax would get stuck generating the same sentences over and over) there is little evidence of any real improvement (although the average thinking time felt shorter).

I would recommend against very low quantizations of GLM 5.0/5.1/5.2 or Kimi 2.5/2.6. Smaller models were more reliable, and therefore more useful.

I have tried llama-cpp, vllm is nicer (ray, handles queueing, doesn't have the cache invalidation bug for qwen/gemma models) and unsloth has toxic employees in their discord.

I've run 2 qwen/gemma @8bit with full context window side-by-side. Right now I have 4 models on my spark (qwen36moe, embedding, reranker, qwen3-1.7B) to support my markdown kb tool.

The setup is not as capable, but still good and gets better with models/algos. To me, it's more about the freedom to tinker, freedom from token bill anxiety, and potential right to compute should the government/oligarchy decides it gets to decide who can access which models.

So far there aren't any open weight model releases for the Qwen 3.7 family.

From the Codex system prompt (verbatim):

```

(...) - Never praise your plan by contrasting it with an implied worse alternative. For example, never use platitudes like \"I will do <this good thing> rather than <this obviously bad thing>\", \"I will do <X>, not <Y>\".

- Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user's query. (...)

```

It seems the OpenAI people added that first bullet to specifically address the tendency the model has, as seen in the parent comment. The goblin stuff coincidentally appears right after in the system prompt, so in included it as a bonus.

Can you try and tune your Claude or whatever LLM you're using for your text to phrase things in plain English. Way less use of antithesis, at least. You can probably find a skill for it, if not get an LLM to write your own.

Just chiming in - the claims above are real, I have very similar numbers in a cluster of 2x GX10 I have access to.

Instructions to reproduce, and benchmarks here: https://forums.developer.nvidia.com/t/deepseek-v4-flash-offi...

no one is making money side-hustling ai models. This is like reddit wet dream. get real, dont get scammed by ppl selling you these dreams.

Highy anecdotal: I have tried various self-hosted models using both vllm and llama.cpp. I am in a situation where I have access to large amount of memory (~320 GB).

I would recommend against very low quantizations of GLM 5.0/5.1/5.2 or Kimi 2.5/2.6. Smaller models were more reliable, and therefore more useful.

I only have access to 96GB VRAM locally, but I'd agree with the general approach of avoiding lower quantizations, often anything below Q8 seems to suffer greatly on quality and seemingly never worth going below it, better to go for smaller model in that case.

With the exception of DwarfStar + DS4-Flash with IQ2_XXS quantization, which somehow seems to not suffer as much as I'd thought. I'd still opt for a smaller model + at least Q8.

I have tried llama-cpp, vllm is nicer (ray, handles queueing, doesn't have the cache invalidation bug for qwen/gemma models) and unsloth has toxic employees in their discord.

I've run 2 qwen/gemma @8bit with full context window side-by-side. Right now I have 4 models on my spark (qwen36moe, embedding, reranker, qwen3-1.7B) to support my markdown kb tool.

> unsloth has toxic employees in their discord

Would you mind elaborating on this?

So far there aren't any open weight model releases for the Qwen 3.7 family.

> So far

Someone's optimistic

From the Codex system prompt (verbatim):

```

- Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user's query. (...)

```

I ran glm 5.2 on rented 8x h200 it could only do 2x concurrency at a cost of $40 an hour. It felt great but dang I wish it was cheaper... It needs 750 at fp8

> You probably want to try renting some time on a dedicated box with roughly the specs you’re considering and running the open models

You don't even need to go that far. For example, with Exoscale Dedicated Inference[1] you just point it at the Hugging Face for the model and quantisation you want to test and it automagically spits out an OpenAI-compatible API endpoint.

[1] https://www.exoscale.com/ai-cloud-infrastructure/dedicated-i...

(I have no relationship with Exoscale, this particular product just crossed my radar recently)

I suspect DwarfStar could probably squeeze more performance out of the single spark, maybe up closer to 20tok/s.

Currently running a custom vLLM build put together by some folks on the Nvidia forums[2], which speaks to how early support for the model is.

[1]: https://github.com/lukealonso/b12x

[2]: https://forums.developer.nvidia.com/t/372268

Personally, I've tried to squeeze more tok/s for a single DGX Spark deployment and DeepSeek V4 Flash but only got marginal improvements. There's work to do on fusing kernels and other optimizations that are already on antirez's roadmap so it is not worth duplicating efforts.

I've had positive experiences running GLM 4.7 via vLLM, tool calling works well and the inference is fast. Do you run DeepSeek V4 Flash on vLLM?

DeepSeek v4 Flash MTP is a training optimization. It doesn't make inference run faster, it must run the entire model forward as the "verifier." This is in the paper, and this is why the docs they release do not mention using it for accelerated inference.

Eventually, I'm going to stop writing stuff like this @dang, because even though it is literally being read by a human, it's going to just be copy and pasted into a chatbot, which will actually spend the time trying to comprehend what I am saying.

no one is making money side-hustling ai models. This is like reddit wet dream. get real, dont get scammed by ppl selling you these dreams.

With the exception of DwarfStar + DS4-Flash with IQ2_XXS quantization, which somehow seems to not suffer as much as I'd thought. I'd still opt for a smaller model + at least Q8.

> unsloth has toxic employees in their discord

Would you mind elaborating on this?

> So far

Someone's optimistic

I'm hoping the decision makers at Qwen notice how influential the 3.6 series is while the 3.7 series has had very little attention at all.

(Of course for all I know the 3.7 series is doing incredibly well in China, but I've seen almost no buzz around it from the circles that I inhabit.)

> You probably want to try renting some time on a dedicated box with roughly the specs you’re considering and running the open models

[1] https://www.exoscale.com/ai-cloud-infrastructure/dedicated-i...

(I have no relationship with Exoscale, this particular product just crossed my radar recently)

I think they're just suggesting renting as a way to test that the hardware they're considering purchasing would actually be able to do what they need.

I ran glm 5.2 on rented 8x h200 it could only do 2x concurrency at a cost of $40 an hour. It felt great but dang I wish it was cheaper... It needs 750 at fp8

what was the concurrency limitation? that node should be able to support a lot more

I've had positive experiences running GLM 4.7 via vLLM, tool calling works well and the inference is fast. Do you run DeepSeek V4 Flash on vLLM?

Yep, those are the numbers I'm getting with DSv4 Flash on vLLM across 2 sparks.

I'm hoping the decision makers at Qwen notice how influential the 3.6 series is while the 3.7 series has had very little attention at all.

(Of course for all I know the 3.7 series is doing incredibly well in China, but I've seen almost no buzz around it from the circles that I inhabit.)

what was the concurrency limitation? that node should be able to support a lot more

Yep, those are the numbers I'm getting with DSv4 Flash on vLLM across 2 sparks.

I think they're just suggesting renting as a way to test that the hardware they're considering purchasing would actually be able to do what they need.

> I think they're just suggesting renting as a way to test

Well, yes, I understood that.

Which is why I started with the words "You don't even need to go that far.".

To re-phrase what I said in clearer terms:

Instead of renting an instance, then messing around with configuring Linux and whatever via SSH or Ansible or whatever. Just point a Hugging Face link at this magic service and get a ready-to-go API back. Enabling you to test your desired model spec with minimum fuss.

Ultimately the guy wants his own hardware. So why waste time messing around with someone else's VM if you just want to test a specific model spec. That is the TL;DR.

> I think they're just suggesting renting as a way to test

Well, yes, I understood that.

Which is why I started with the words "You don't even need to go that far.".

To re-phrase what I said in clearer terms:

Ultimately the guy wants his own hardware. So why waste time messing around with someone else's VM if you just want to test a specific model spec. That is the TL;DR.

My agent stack with Hermes runs on a workstation. The models run on a DGX Spark on the same LAN. The split is deliberate: the workstation stays responsive, the Spark does the GPU work, and they talk over an HTTP proxy.

Since I started managing the agent fleet through Clawrium, the Hermes count has climbed. More agents on more hosts, more concurrent traffic, all hitting the same Spark. What was a one-laptop, one-model setup is now a small fleet against a single backend — and the shape of the load is exactly what a single-model server can’t serve.

Fleet snapshot with different providers (orchestration using Clawrium)

The Spark served models through ollama for months. It worked. One model up, single config, easy to bring down.

But ollama owns the card. There’s no per-process memory budget, no gpu_memory_utilization knob, no straightforward way to coresident a heavy model for reasoning and a fast model for quick turns. KV cache management is whatever the underlying llama.cpp backend gives you. PagedAttention isn’t there.

vLLM fixes all of that.

PagedAttention reclaims KV blocks instead of contiguous-pinning them.
gpu_memory_utilization gives you a per-container budget.

One Spark (GB10, 119.67 GiB unified memory) can run multiple vLLM containers behind a LiteLLM proxy on :4000, and Hermes hits one URL to route to either model. The promise: serve Qwen3-Next-80B-Instruct-FP8 for the heavy work and Qwen3-4B-Instruct-2507 for fast turns, coresident, both reachable from a single endpoint.

That’s the why. What follows is what it took to make the promise hold.

Spark hardware will happily hold two Qwen3 models if the numbers line up. They didn’t, for several days. That’s where my last weekend went.

First 80B config: gpu_memory_utilization: 0.75, max_model_len: 65536, max_num_seqs: 4. vLLM’s KV cache init crashed with “No available memory for the cache blocks.” Qwen3-Next is mostly Mamba; the per-block page alignment pushes KV pool demand higher than the ~14 GiB residue after weights.

Bumped to 0.85. Now the free-memory check crashed: “Free memory on device (98.51/119.67 GiB) is less than desired GPU memory utilization (0.85, 101.72 GiB).” The 4B was already resident at ~16 GiB. The 80B’s 0.85 target was reading the whole card, not what was free.

That’s the first lesson. gpu_memory_utilization is a fraction of total GPU memory, not free memory.

Two co-resident vLLM processes need their fractions to sum below ~0.95 to leave room for CUDA framework overhead. If your math assumes free, you’ll oscillate between OOMs and silent KV starvation.

Settled at 0.80 / 32k / 2 for the 80B. Loaded clean. KV pool ~20.8 GiB after weights.

Then Hermes came online and tool calls came back as plain text. <tool_call> JSON sitting inside content. tool_calls: []. finish_reason: stop. Hermes never executed it.

A day of parser triage produced nothing actionable. Both hermes_tool_parser.py and qwen3xml_tool_parser.py look for <tool_call> (singular). The <tools> plural tag is the system-prompt definition, not the output. The parser wasn’t wrong. The model wasn’t emitting.

tool_choice: "required" worked. tool_choice: "auto" came back empty: tool_calls: [], content: "", 619 characters of reasoning inside <think> concluding “Alright, that’s it” without emitting the call.

Qwen’s own model card states it plainly: Qwen3-Next-80B-Thinking supports only thinking mode. enable_thinking: false is a structural no-op on this checkpoint. /no_think in the prompt is ignored. The model reasons inside <think>, decides, and never emits.

That’s an unrecoverable failure for any agent SDK that defaults to tool_choice: "auto". The fix wasn’t a parser flag. It was swapping the whole 80B backbone from Thinking to Instruct.

77 GiB pre-pull. Drain GPU. Bring up with --enable-auto-tool-choice --tool-call-parser hermes, no --reasoning-parser. Three LiteLLM aliases (writer / reviewer / sources) all passed tool_choice: "auto" cleanly with finish_reason: tool_calls. Trade accepted: reviewer loses native <think> traces. Reasoning moved into the prompt.

Reviewer agent (running on Hermes) needed 64k context. Bumped the 80B to 0.85 / 65536 / 2. 80B loaded healthy. The 4B’s restart loop kicked in 19 times: “Free memory on device (12.58/119.67 GiB) is less than desired GPU memory utilization (0.12, 14.36 GiB).”

80B’s actual residency at 0.85 was 101.5 GiB. Plus ~5 GiB CUDA framework overhead. That left ~12.5 GiB free. The 4B needed 14.36 GiB. No room.

Toned the 80B back to 0.80, dropped the 4B to 0.10 / 16384 / 8. Both came up healthy. The 4B’s max_model_len had to drop because the 0.10 allocation leaves only ~~3.5 GiB for KV pool — 32k single-seq KV demand (~~4.8 GiB) doesn’t fit; 16k (~2.4 GiB) does.

This is the table I wish I’d built on day one:

Three observations from the actuals.

The 80B’s actual residency at 0.80 ran 8 GiB under allocation. That cushion is the only reason the 4B’s restart variability doesn’t break the deployment. At 0.85, the cushion went negative — same hardware, same models, same vLLM build.

The 4B at 0.10 actually resides at 13.8 GiB, not the 12 GiB the target implies. CUDA framework overhead doesn’t disappear at small allocations.

On Qwen3-Next specifically, max_model_len × max_num_seqs is dominated by Mamba state alignment, not attention KV. Halving max_model_len doesn’t halve KV pool demand the way it does on a pure attention model. Plan KV against Mamba page sizes, not against intuition from Llama-class models.

Once the wiring was complete, LiteLLM showed all the aliases for the same two models running on the spark.

gpu_memory_utilization is a snapshot vLLM takes at process start, against total card memory. It is not a target against free memory. CUDA contexts from prior failed attempts can transiently inflate residency and trip the check spuriously. Co-resident processes don’t negotiate — they race.

The only number that matters is actual residency after both processes have stabilized, measured against the headroom the harder-to-restart model needs to come back from a crash. Target allocations are a planning input; actuals are the ground truth.

For a two-model Spark deployment, the playbook is: load the bigger model first, let it settle, run nvidia-smi to read actual residency, then size the smaller model’s gpu_memory_utilization against the free pool minus ~5 GiB for its own framework overhead. Recheck after both restart cleanly twice.

If you have a vLLM deployment running right now, pull this:

nvidia-smi --query-gpu=memory.used --format=csv

Compare the actual number to what your gpu_memory_utilization target implies. If the two diverge by more than 10%, your sizing model is wrong. Fix it before you ship anything that depends on coresidency — agent stacks, parallel workers, fallback chains. The math has to be empirical, not aspirational.

If you’re standing up a similar local-LLM stack — DGX Spark(or other hardware), vLLM, multiple coresident models, or wiring a remote agent fleet to a single inference backend — I’d love to compare notes.

No posts

Hacker Times

Hacker Times

Two Qwen3 models on one DGX Spark: the residency math

Discussion

Discussion