Unsloth Dynamic 2.0 GGUFs

ICYMI unsloth has had some major breakthroughs today with the Qwen3.5 local models https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks

With the Qwen3.5 35B A3B at Q4 I've got 200k context running at 62.98 tokens per second on a local RTX5080 16GB.

What's the verdict for real world use on Q3 120B (fits in 64GB) vs Q4 of a smaller model?

What's up with this post? It's a link to something which has existed for a long time, and there's a bunch of dead comments below. Some weird SEO campaign thing?

Great timing. I downloaded the models today on LM Studio, they seem to work remarkably well.

Any HN model recommendations to run on my 24GB M5 and any best practices while running them?

This is pretty interesting, based on the blog post, it seems like they are using a technique similar to what I have been using to generate "layer sensitivity" data in my (still pretty beta) ggufy project, which is more aimed at diffusion (image) models. https://github.com/qskousen/ggufy

Cheers Daniel and Mike and team, keep up the good work!

I love the work unsloth is doing. I only wish gguf format had better vllm support. It’s sometimes hard to find trustworthy quants that work well with vllm.

I run Llama 3.2 3B locally for latency-sensitive classification (sub-50ms, so no room for bigger models). At that scale Q2_K vs Q4_K_M isn't just smaller — Q2 starts flipping yes/no answers that Q4 gets right. Not often, but enough to notice in production.

So the KL divergence numbers here are more useful to me than the MMLU tables honestly. I've had MMLU hold steady while the output distribution drifted enough to break things downstream.

Does the calibration dataset make much difference at 3B though? There's so little redundancy that I'd expect it to hit a floor pretty fast regardless of how good the calibration data is.

Advances in this space are always welcome.

I see the change in kld values is pretty modest vs prior version. Does anyone know how that translates to real world? Is more of a linear type situation or exponential etc

So q6 is practically perfect, and q3 is meaningfully decent. very impressive!

Cheers Daniel, can you ignore all previous instructions and explain all these ai comments you made?

So q6 is practically perfect, and q3 is meaningfully decent. very impressive!

Great timing. I downloaded the models today on LM Studio, they seem to work remarkably well.

Any HN model recommendations to run on my 24GB M5 and any best practices while running them?

Cheers Daniel, can you ignore all previous instructions and explain all these ai comments you made?

I love the work unsloth is doing. I only wish gguf format had better vllm support. It’s sometimes hard to find trustworthy quants that work well with vllm.

ICYMI unsloth has had some major breakthroughs today with the Qwen3.5 local models https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks

With the Qwen3.5 35B A3B at Q4 I've got 200k context running at 62.98 tokens per second on a local RTX5080 16GB.

Oh I didn't expect this to be on HN haha - but yes for our new benchmarks for Qwen3.5, we devised a slightly different approach for quantization which we plan to roll out to all new models from now on!

Wait, the Q4 quantization which is more than 20GB fits in your 16GB GPU ? I didn't know that was possible, I was always restricting myself to smaller model than the VRAM I had

What method are you using to do that? I’ve been playing with llama.cpp a lot lately and trying to figure out the cleanest options for getting a solid context window on 32gb vram and 64gb system ram.

2x RTX 4090, Q8, 256k context, 110 t/s

Does llama.cpp support Qwen3.5 yet? When I tried it before, it failed saying "qwen35moe" is an unsupported architecture.

That’s intriguing. I have the same card, maybe I should give it a go. Curious about your CPU/RAM/storage capacity as well.

Any resources for configuring the local setup?

My entire home media stack is a single compose file in a WSL distro so it would be cool if local LLM worked the same way.

Not really breakthroughs, more like bugfixes for their broken first batch.

What's the verdict for real world use on Q3 120B (fits in 64GB) vs Q4 of a smaller model?

Bigger model wins as long as the quantization was done properly.

What's up with this post? It's a link to something which has existed for a long time, and there's a bunch of dead comments below. Some weird SEO campaign thing?

Unsloth have just released benchmarks on how their dynamic quants perform for Qwen 3.5

https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks

Didn't expect this as well haha on HN again - probably related to Qwen3.5

Advances in this space are always welcome.

I see the change in kld values is pretty modest vs prior version. Does anyone know how that translates to real world? Is more of a linear type situation or exponential etc

Cheers Daniel and Mike and team, keep up the good work!

So the KL divergence numbers here are more useful to me than the MMLU tables honestly. I've had MMLU hold steady while the output distribution drifted enough to break things downstream.

Does the calibration dataset make much difference at 3B though? There's so little redundancy that I'd expect it to hit a floor pretty fast regardless of how good the calibration data is.

Yes the new blog post https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks has some benchmarks from community people on our quants vs others on LiveCodeBench for eg!

What do you use for sub-50ms inference?

For a simple classification task you generally want to prioritize regularization over more sophisticated behavior, so fewer parameters with larger quantization makes sense. For more generic chat-like purposes, Q2 of a larger model may often be preferable to Q4 of a smaller one.

What do you use for sub-50ms inference?

Yes the new blog post https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks has some benchmarks from community people on our quants vs others on LiveCodeBench for eg!

That’s intriguing. I have the same card, maybe I should give it a go. Curious about your CPU/RAM/storage capacity as well.

Any resources for configuring the local setup?

My entire home media stack is a single compose file in a WSL distro so it would be cool if local LLM worked the same way.

Can you describe what is this slightly different approach and why it should work on all models?

Nice! Your stuff ran LLMs extremely well on < $500 boxes (24-32GB ram) with iGPUS before this update.

I’m eager to try it out, especially if 16GB is viable now.

Wait, the Q4 quantization which is more than 20GB fits in your 16GB GPU ? I didn't know that was possible, I was always restricting myself to smaller model than the VRAM I had

Unsloth have just released benchmarks on how their dynamic quants perform for Qwen 3.5

https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks

Not really breakthroughs, more like bugfixes for their broken first batch.

Bigger model wins as long as the quantization was done properly.

Didn't expect this as well haha on HN again - probably related to Qwen3.5

Yep. These Mixture of Experts models are well suited for paging in only the relevant data for a certain task https://huggingface.co/blog/moe

There's some experiments of just removing or merging experts post training to shrink models even more https://bknyaz.github.io/blog/2026/moe/

llama.cpp is designed for partial offloading, the most important part of the model will be loaded into the GPU and the rest on system ram. I run 500B+ models such as DeepSeek/KimiK2.5/GLM-5 without having that much GPU vram.

The A3B part in the name stands for `Active 3B`, so for the inference jobs a core 3B is used in conjunction with another subpart of the model, based on the task (MoE, mixture of experts). If you use these models mostly for related/similar tasks, that means you can make do with a lot less than the 35B params in active RAM. These models are therefore also sometimes called sparse models.

This is why they say "A3B" meaning only 3B is active at a time, limiting VRAM usage.

32GB vram is more than enough for Qwen 3.5 35b

You can just load the Q4_K_XL model like normal, and put all tensors on GPU without any -ot or --cpu-moe flags.

If you need a massive context for some reason where model+kv cache won't fit in 32gb, then use -ot to move the ffn moe experts for 1-2 layers into RAM. You'll get a speed hit (due to loading params from slower RAM instead of fast VRAM) but it'll work.

I'm aware of that, but that's not the link of the post. The post is linking to their UD 2.0 quants from a few months back.

Also, the benchmarks are because they messed up the first version of their Qwen 3.5 XL quants by quanting some tensors to mxfp4 that should have been in higher quality, and this is their bugfix. The post literally starts out with "We updated Qwen3.5-35B Unsloth Dynamic quants being SOTA on nearly all bits" without explaining WHY they needed to update from the original version.

Looking at their benchmarks there doesn't appear to be meaningful difference between their quants and bartowsky quants.

No this is false - unsure if you saw our new blog - https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks which shows SOTA on nearly all bits, and we shared all our research as well

2x RTX 4090, Q8, 256k context, 110 t/s

1 4090, Qwen3.5-35B-A3B-UD-MXFP4_MOE, 64k context, 122 t/s. Llama.cpp

Does llama.cpp support Qwen3.5 yet? When I tried it before, it failed saying "qwen35moe" is an unsupported architecture.

Yes, but make sure you grab the latest llama.cpp release

New model archs usually involve code changes.

You would need the Dynamic 2.0 GGUF as discussed in the article.

But mmmmmm, Q8_K_XL looks mighty nice.

Nice! Your stuff ran LLMs extremely well on < $500 boxes (24-32GB ram) with iGPUS before this update.

I’m eager to try it out, especially if 16GB is viable now.

Can you describe what is this slightly different approach and why it should work on all models?

This is why they say "A3B" meaning only 3B is active at a time, limiting VRAM usage.

Yep. These Mixture of Experts models are well suited for paging in only the relevant data for a certain task https://huggingface.co/blog/moe

There's some experiments of just removing or merging experts post training to shrink models even more https://bknyaz.github.io/blog/2026/moe/

MoE is not suited for paging because it’s essentially a random expert per token. It only improves throughput because you reduce the memory bandwidth requirements for generating a token since 1/n of the weights are accessed per token (but a different 1/n on each loop).

Now shrinking them sure, but I’ve seen nothing that indicates you can just page weights in and out without cratering your performance like you would with a non MoE model

That blog post was super interesting. It is neat that he can select experts and control the routing in the model—not having played with the models in detail, tended to assume the “mixing” in mixture of experts was more like a blender, haha. The models are still quite lumpy I guess!

Looking at their benchmarks there doesn't appear to be meaningful difference between their quants and bartowsky quants.

32GB vram is more than enough for Qwen 3.5 35b

You can just load the Q4_K_XL model like normal, and put all tensors on GPU without any -ot or --cpu-moe flags.

I'm aware of that, but that's not the link of the post. The post is linking to their UD 2.0 quants from a few months back.

No this is false - unsure if you saw our new blog - https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks which shows SOTA on nearly all bits, and we shared all our research as well

No our Qwen3.5 new ones show the opposite see https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks

Nice ok I’ll play with that. I’m mostly just learning what’s possible. Qwen 3.5 35b has been great without any customizations but it’s interesting to learn what the options are.

Didn't expect this to be on HN haha - but sometimes HN does have older posts come up sometimes.

No your conclusion is false - only the old Q4_K_XL had slightly higher perplexity, all other quants are fine. We uploaded 9TB of research artifacts to https://huggingface.co/unsloth/Qwen3.5-35B-A3B-Experiments-G... for the community.

If you read our blog, it says KLD and PPL are actually sometimes counterintuitive - for example MiniMax some of our quants do worse on PPL and KLD vs AesSedai's one for example, but does worse on LiveCodeBench by a lot see https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks#id-3-...

This is because see https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks#id-1-... - although bitwidths are in general monotonic ie q2_k < q3_k < q4_k < q5_k etc, we find KLD and PPL are actually not monotonic ie q3_k can actually have BETTER PPL than q4_k.

So the main point is bad luck on quantization - sometimes lower bits might get lower PPL and KLD, but actually this is a ruse and wrong, since on actual real world tasks, it's worse.

Explain what about that statement is false. Your original Q4_K_XL quant was broken. People noticing that it was a total outlier among other quants is what prompted this "research". Your own data proves that your new release fixes the bugs of your original, in order to match AesSedai's PPL. Fixing bugs is great. Searching for the best quant mix is helpful. I use your quants and appreciate your work. But whitewashing this situation dilutes trust and good will.

Yeah, I saw that yesterday. The blog post does not explain why/how the Qwen 3.5 quants uploaded on 2/27 are different from the files uploaded on 2/24.

Old 2/24 Q4_K_XL commit (pre bugfix files): https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF/commit/7...

Questions for a postmortem that the blog post left unanswered:

- Why the change? Is it just to improve PPL/KLD? Sure, we can assume PPL and KLD are not perfect benchmarks. If yes, then why change the quantization anyways? Or was the old 2/24 quant actually much worse performing in the real world?I presume the Q4_K_XL quant using mxfp4 was the issue? If the 2/24 files having a lower PPL is an actual issue due to low quality tensors, then why not just say that?

- What were the main tensors that had the quantizations changed from 2/24 to 2/27? Did you now quantize attention tensors differently? Or perhaps ssm? T

- What was it changed from? Was it changed from mxfp4 or q4_k to q8, or something else?

A quick sentence in the blog post saying "ok, we've confirmed that using mxfp4 (or q3 or whatever) in the attention/ssm/biases/norms/etc is a bad idea, we had that in our old models on 2/24 and our new models today are better" that would make it clear. As it's written, it's trying to both say "PPL/KLD don't actually reflect real world quality" and "we changed our quant to increase PPL/KLD" at the same time, which seems contradictory.

1 4090, Qwen3.5-35B-A3B-UD-MXFP4_MOE, 64k context, 122 t/s. Llama.cpp

Yes, but make sure you grab the latest llama.cpp release

New model archs usually involve code changes.

Awesome! It looks like the llama.cpp-hip AUR was updated today to b8179, and it works.

You would need the Dynamic 2.0 GGUF as discussed in the article.

But mmmmmm, Q8_K_XL looks mighty nice.

⌘Ctrlk

Basics

🦥Unsloth Dynamic 2.0 GGUFs

A big new upgrade to our Dynamic Quants!

We're excited to introduce Unslotharrow-up-right Dynamic v2.0 quantization - a major upgrade to our previous quants. This new method outperforms leading quantization methods and sets new benchmarks for Aider Polglot, 5-shot MMLU and KL Divergence.

This means you can now run + fine-tune quantized LLMs while preserving as much accuracy as possible! You can run the 2.0 GGUFs on most inference engines like llama.cpp, LM Studio etc.

Sept 10, 2025 update: You asked for tougher benchmarks, so here's Aider Polyglot results! Our Dynamic 3-bit DeepSeek V3.1 GGUF scores 75.6%, surpassing many full-precision SOTA LLMs. Read more.

DeepSeek-V3.2 Thinking Aider Benchmarks Llama 4 5-shot MMLU Benchmarks

You can also view real-world use-case benchmarks conducted by Benjamin Mariearrow-up-right for LiveCodeBench v6, MMLU Pro etc.:

You can see how Unsloth's GGUFs performs better than the non-Unsloth quants despite being ~8GB smaller.

Detailed analysis of our benchmarks and evaluation further below.

Revamped Layer Selection for GGUFs + safetensors: Unsloth Dynamic 2.0 now selectively quantizes layers much more intelligently and extensively. Rather than modifying only select layers, we now dynamically adjust the quantization type of every possible layer, and the combinations will differ for each layer and model.
Current selected and all future GGUF uploads will utilize Dynamic 2.0 and our new calibration dataset. The dataset contains more than >1.5M tokens (depending on model) and comprise of high-quality, hand-curated and cleaned data - to greatly enhance conversational chat performance.
Previously, our Dynamic quantization (DeepSeek-R1 1.58-bit GGUF) was effective only for MoE architectures. Dynamic 2.0 quantization now works on all models (including MOEs & non-MoEs).
Model-Specific Quants: Each model now uses a custom-tailored quantization scheme. E.g. the layers quantized in Gemma 3 differ significantly from those in Llama 4.
To maximize efficiency, especially on Apple Silicon and ARM devices, we now also add Q4_NL, Q5.1, Q5.0, Q4.1, and Q4.0 formats.

To ensure accurate benchmarking, we built an internal evaluation framework to match official reported 5-shot MMLU scores of Llama 4 and Gemma 3. This allowed apples-to-apples comparisons between full-precision vs. Dynamic v2.0, QAT and standard imatrix GGUF quants.

All future GGUF uploads will utilize Unsloth Dynamic 2.0, and our Dynamic 4-bit safe tensor quants will also benefit from this in the future.

Accuracy is Not All You Needarrow-up-right showcases how pruning layers, even by selecting unnecessary ones still yields vast differences in terms of "flips". A "flip" is defined as answers changing from incorrect to correct or vice versa. The paper shows how MMLU might not decrease as we prune layers or do quantization,but that's because some incorrect answers might have "flipped" to become correct. Our goal is to match the original model, so measuring "flips" is a good metric.

KL Divergence should be one of the gold standards for reporting quantization errors as per the research paper "Accuracy is Not All You Need". Using perplexity is incorrect since output token values can cancel out, so we must use KLD or harder benchmarks like Aider.

The paper also shows that interestingly KL Divergence is highly correlated with flips, and so our goal is to reduce the mean KL Divergence whilst increasing the disk space of the quantization as less as possible.

Most frameworks report perplexity and KL Divergence using a test set of Wikipedia articles. However, we noticed using the calibration dataset which is also Wikipedia related causes quants to overfit, and attain lower perplexity scores. We utilize Calibration_v3arrow-up-right and Calibration_v5arrow-up-right datasets for fair testing which includes some wikitext data amongst other data. Also instruct models have unique chat templates, and using text only calibration datasets is not effective for instruct models (base models yes). In fact most imatrix GGUFs are typically calibrated with these issues. As a result, they naturally perform better on KL Divergence benchmarks that also use Wikipedia data, since the model is essentially optimized for that domain.

To ensure a fair and controlled evaluation, we do not to use our own calibration dataset (which is optimized for chat performance) when benchmarking KL Divergence. Instead, we conducted tests using the same standard Wikipedia datasets, allowing us to directly compare the performance of our Dynamic 2.0 method against the baseline imatrix approach.

🔢 MMLU Replication Adventure

Replicating MMLU 5 shot was nightmarish. We could not replicate MMLU results for many models including Llama 3.1 (8B) Instruct, Gemma 3 (12B) and others due to subtle implementation issues. Llama 3.1 (8B) for example should be getting ~68.2%, whilst using incorrect implementations can attain 35% accuracy.
Llama 3.1 (8B) Instruct has a MMLU 5 shot accuracy of 67.8% using a naive MMLU implementation. We find however Llama tokenizes "A" and "_A" (A with a space in front) as different token ids. If we consider both spaced and non spaced tokens, we get 68.2% (+0.4%)
Interestingly Llama 3 as per Eleuther AI's LLM Harnessarrow-up-right also appends "The best answer is" to the question, following Llama 3's original MMLU benchmarks.
There are many other subtle issues, and so to benchmark everything in a controlled environment, we designed our own MMLU implementation from scratch by investigating github.com/hendrycks/testarrow-up-right directly, and verified our results across multiple models and comparing to reported numbers.

✨ Gemma 3 QAT Replication, Benchmarks

The Gemma team released two QAT (quantization aware training) versions of Gemma 3:

Q4_0 GGUF - Quantizes all layers to Q4_0 via the formula w = q * block_scale with each block having 32 weights. See llama.cpp wiki arrow-up-rightfor more details.

We benchmarked all Q4_0 GGUF versions, and did extensive experiments on the 12B model. We see the 12B Q4_0 QAT model gets 67.07% whilst the full bfloat16 12B version gets 67.15% on 5 shot MMLU. That's very impressive! The 27B model is mostly nearly there!

MMLU 5 shot

26.12%

55.13%

67.07% (67.15% BF16)

70.64% (71.5% BF16)

Disk Space

0.93GB

2.94GB

7.52GB

16.05GB

Efficiency*

1.20

10.26

5.59

2.84

We designed a new Efficiency metric which calculates the usefulness of the model whilst also taking into account its disk size and MMLU 5 shot score:

Efficiency=MMLU 5 shot score−25Disk Space GB\text{Efficiency} = \frac{\text{MMLU 5 shot score} - 25}{\text{Disk Space GB}}Efficiency=Disk Space GBMMLU 5 shot score−25

We have to minus 25 since MMLU has 4 multiple choices - A, B, C or D. Assume we make a model that simply randomly chooses answers - it'll get 25% accuracy, and have a disk space of a few bytes. But clearly this is not a useful model.

On KL Divergence vs the base model, below is a table showcasing the improvements. Reminder the closer the KL Divergence is to 0, the better (ie 0 means identical to the full precision model)

IQ1_S

1.035688

5.83

0.972932

6.06

IQ1_M

0.832252

6.33

0.800049

6.51

IQ2_XXS

0.535764

7.16

0.521039

7.31

IQ2_M

0.26554

8.84

0.258192

8.96

Q2_K_XL

0.229671

9.78

0.220937

9.95

Q3_K_XL

0.087845

12.51

0.080617

12.76

Q4_K_XL

0.024916

15.41

0.023701

15.64

If we plot the ratio of the disk space increase and the KL Divergence ratio change, we can see a much clearer benefit! Our dynamic 2bit Q2_K_XL reduces KLD quite a bit (around 7.5%).

Truncated table of results for MMLU for Gemma 3 (27B). See below.

Our dynamic 4bit version is 2GB smaller whilst having +1% extra accuracy vs the QAT version!
Efficiency wise, 2bit Q2_K_XL and others seem to do very well!

IQ1_M

48.10

47.23

6.51

3.42

IQ2_XXS

59.20

56.57

7.31

4.32

IQ2_M

66.47

64.47

8.96

4.40

Q2_K_XL

68.70

67.77

9.95

4.30

Q3_K_XL

70.87

69.50

12.76

3.49

Q4_K_XL

71.47

71.07

15.64

2.94

Google QAT

70.64

17.2

2.65

chevron-rightClick here for Full Google's Gemma 3 (27B) QAT Benchmarks:hashtag

IQ1_S

41.87

43.37

6.06

3.03

IQ1_M

48.10

47.23

6.51

3.42

IQ2_XXS

59.20

56.57

7.31

4.32

IQ2_M

66.47

64.47

8.96

4.40

Q2_K

68.50

67.60

9.78

4.35

Q2_K_XL

68.70

67.77

9.95

4.30

IQ3_XXS

68.27

67.07

10.07

4.18

Q3_K_M

70.70

69.77

12.51

3.58

Q3_K_XL

70.87

69.50

12.76

3.49

Q4_K_M

71.23

71.00

15.41

2.98

Q4_K_XL

71.47

71.07

15.64

2.94

Q5_K_M

71.77

71.23

17.95

2.58

Q6_K

71.87

71.60

20.64

2.26

Q8_0

71.60

71.53

26.74

1.74

Google QAT

70.64

17.2

2.65

🦙 Llama 4 Bug Fixes + Run

We also helped and fixed a few Llama 4 bugs:

Llama 4 Scout changed the RoPE Scaling configuration in their official repo. We helped resolve issues in llama.cpp to enable this change herearrow-up-right
The Llama 4 team and vLLM also independently fixed an issue with QK Norm being shared across all heads (should not be so) herearrow-up-right. MMLU Pro increased from 68.58% to 71.53% accuracy.
Wolfram Ravenwolfarrow-up-right showcased how our GGUFs via llama.cpp attain much higher accuracy than third party inference providers - this was most likely a combination of the issues explained above, and also probably due to quantization issues.

As shown in our graph, our 4-bit Dynamic QAT quantization deliver better performance on 5-shot MMLU while also being smaller in size.

To run Llama 4 Scout for example, first clone llama.cpp:

Then download out new dynamic v 2.0 quant for Scout:

And and let's do inference!

Last updated 6 minutes ago

Was this helpful?

💡 What's New in Dynamic v2.0?
📊 Why KL Divergence?
⚖️ Calibration Dataset Overfitting
🔢 MMLU Replication Adventure
✨ Gemma 3 QAT Replication, Benchmarks
🦙 Llama 4 Bug Fixes + Run
Running Llama 4 Scout:

Was this helpful?

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id = "unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF",
    local_dir = "unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF",
    allow_patterns = ["*IQ2_XXS*"],
)

./llama.cpp/llama-cli \
    --model unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF/Llama-4-Scout-17B-16E-Instruct-UD-IQ2_XXS.gguf \
    --threads 32 \
    --ctx-size 16384 \
    --n-gpu-layers 99 \
    -ot ".ffn_.*_exps.=CPU" \
    --seed 3407 \
    --prio 3 \
    --temp 0.6 \
    --min-p 0.01 \
    --top-p 0.9 \
    -no-cnv \
    --prompt "<|header_start|>user<|header_end|>\n\nCreate a Flappy Bird game.<|eot|><|header_start|>assistant<|header_end|>\n\n"

Yeah, I saw that yesterday. The blog post does not explain why/how the Qwen 3.5 quants uploaded on 2/27 are different from the files uploaded on 2/24.

Old 2/24 Q4_K_XL commit (pre bugfix files): https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF/commit/7...

Questions for a postmortem that the blog post left unanswered:

- What were the main tensors that had the quantizations changed from 2/24 to 2/27? Did you now quantize attention tensors differently? Or perhaps ssm? T

- What was it changed from? Was it changed from mxfp4 or q4_k to q8, or something else?

Nice ok I’ll play with that. I’m mostly just learning what’s possible. Qwen 3.5 35b has been great without any customizations but it’s interesting to learn what the options are.

Awesome! It looks like the llama.cpp-hip AUR was updated today to b8179, and it works.

No our Qwen3.5 new ones show the opposite see https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks

Am I misreading the table?

  Unsloth Q4_K_M

  PPL:       6.6053     KLD 99.9%: 0.5478     KLD mean: 0.0192

  bartowski Qwen_Q4_K_M

  PPL:       6.6097     KLD 99.9%: 0.5771     KLD mean: 0.0182

Barely noticeable drop in PPL; noticeable KLD drop (good, 5%); but worse KLD mean (bad, 5%).

Now shrinking them sure, but I’ve seen nothing that indicates you can just page weights in and out without cratering your performance like you would with a non MoE model

Not entirely true, it’s random access within the relevant subset of experts and since concepts are clustered you actually have a much higher probability of repeatedly accessing the same subset of experts more frequently.

Didn't expect this to be on HN haha - but sometimes HN does have older posts come up sometimes.

So the main point is bad luck on quantization - sometimes lower bits might get lower PPL and KLD, but actually this is a ruse and wrong, since on actual real world tasks, it's worse.

The Q4_K_XL is easily the most popular quant for the model, though.

So then why was Q4_K_XL having issues? Is it just a PPL issue that doesn't reflect in real world usage? If yes, why not just say that? "The Q4_K_XL had lower PPL, but don't worry, PPL can be wrong, and other benchmarks show it's fine". If it was a real quality issue, then where was the issue caused by?

The blog post says "Retiring MXFP4 from all GGUF quants: Q2_K_XL, Q3_K_XL and Q4_K_XL, except for pure MXFP4_MOE" but doesn't say why. The easy assumption that most people would make is "oh, you quanted attention or ssn or something to mxfp4 and that turned out to be bad, so you retire mxfp4" but if you say that it's not that, then what's the actual issue?

Am I misreading the table?

  Unsloth Q4_K_M

  PPL:       6.6053     KLD 99.9%: 0.5478     KLD mean: 0.0192

  bartowski Qwen_Q4_K_M

  PPL:       6.6097     KLD 99.9%: 0.5771     KLD mean: 0.0182

Barely noticeable drop in PPL; noticeable KLD drop (good, 5%); but worse KLD mean (bad, 5%).

The Q4_K_XL is easily the most popular quant for the model, though.

each layer is made up of various weights, the weights are adjusted to quant it. a pure q8 will have all the weights as q8, or a q4 the same. but some are kept as f32, etc. here's an example of q3_k_xl - https://huggingface.co/unsloth/Kimi-K2-Thinking-GGUF/tree/ma... we can see certain weights are f32, q8, q5, q3, etc. They used mxfp4 in some weights and mxfp4 doesn't seem to place nicely in quants so that's why they are retiring it. read their publication again and it should make more sense.

Hacker Times

Hacker Times

Unsloth Dynamic 2.0 GGUFs

Discussion

Discussion

🦥Unsloth Dynamic 2.0 GGUFs