Any HN model recommendations to run on my 24GB M5 and any best practices while running them?
With the Qwen3.5 35B A3B at Q4 I've got 200k context running at 62.98 tokens per second on a local RTX5080 16GB.
I see the change in kld values is pretty modest vs prior version. Does anyone know how that translates to real world? Is more of a linear type situation or exponential etc
So the KL divergence numbers here are more useful to me than the MMLU tables honestly. I've had MMLU hold steady while the output distribution drifted enough to break things downstream.
Does the calibration dataset make much difference at 3B though? There's so little redundancy that I'd expect it to hit a floor pretty fast regardless of how good the calibration data is.
Any resources for configuring the local setup?
My entire home media stack is a single compose file in a WSL distro so it would be cool if local LLM worked the same way.
I’m eager to try it out, especially if 16GB is viable now.
There's some experiments of just removing or merging experts post training to shrink models even more https://bknyaz.github.io/blog/2026/moe/
You can just load the Q4_K_XL model like normal, and put all tensors on GPU without any -ot or --cpu-moe flags.
If you need a massive context for some reason where model+kv cache won't fit in 32gb, then use -ot to move the ffn moe experts for 1-2 layers into RAM. You'll get a speed hit (due to loading params from slower RAM instead of fast VRAM) but it'll work.
Also, the benchmarks are because they messed up the first version of their Qwen 3.5 XL quants by quanting some tensors to mxfp4 that should have been in higher quality, and this is their bugfix. The post literally starts out with "We updated Qwen3.5-35B Unsloth Dynamic quants being SOTA on nearly all bits" without explaining WHY they needed to update from the original version.
New model archs usually involve code changes.
But mmmmmm, Q8_K_XL looks mighty nice.
⌘Ctrlk
A big new upgrade to our Dynamic Quants!
We're excited to introduce Unslotharrow-up-right Dynamic v2.0 quantization - a major upgrade to our previous quants. This new method outperforms leading quantization methods and sets new benchmarks for Aider Polglot, 5-shot MMLU and KL Divergence.
This means you can now run + fine-tune quantized LLMs while preserving as much accuracy as possible! You can run the 2.0 GGUFs on most inference engines like llama.cpp, LM Studio etc.
Sept 10, 2025 update: You asked for tougher benchmarks, so here's Aider Polyglot results! Our Dynamic 3-bit DeepSeek V3.1 GGUF scores 75.6%, surpassing many full-precision SOTA LLMs. Read more.


You can also view real-world use-case benchmarks conducted by Benjamin Mariearrow-up-right for LiveCodeBench v6, MMLU Pro etc.:
You can see how Unsloth's GGUFs performs better than the non-Unsloth quants despite being ~8GB smaller.
Detailed analysis of our benchmarks and evaluation further below.
Revamped Layer Selection for GGUFs + safetensors: Unsloth Dynamic 2.0 now selectively quantizes layers much more intelligently and extensively. Rather than modifying only select layers, we now dynamically adjust the quantization type of every possible layer, and the combinations will differ for each layer and model.
Current selected and all future GGUF uploads will utilize Dynamic 2.0 and our new calibration dataset. The dataset contains more than >1.5M tokens (depending on model) and comprise of high-quality, hand-curated and cleaned data - to greatly enhance conversational chat performance.
Previously, our Dynamic quantization (DeepSeek-R1 1.58-bit GGUF) was effective only for MoE architectures. Dynamic 2.0 quantization now works on all models (including MOEs & non-MoEs).
Model-Specific Quants: Each model now uses a custom-tailored quantization scheme. E.g. the layers quantized in Gemma 3 differ significantly from those in Llama 4.
To maximize efficiency, especially on Apple Silicon and ARM devices, we now also add Q4_NL, Q5.1, Q5.0, Q4.1, and Q4.0 formats.
To ensure accurate benchmarking, we built an internal evaluation framework to match official reported 5-shot MMLU scores of Llama 4 and Gemma 3. This allowed apples-to-apples comparisons between full-precision vs. Dynamic v2.0, QAT and standard imatrix GGUF quants.
All future GGUF uploads will utilize Unsloth Dynamic 2.0, and our Dynamic 4-bit safe tensor quants will also benefit from this in the future.
Accuracy is Not All You Needarrow-up-right showcases how pruning layers, even by selecting unnecessary ones still yields vast differences in terms of "flips". A "flip" is defined as answers changing from incorrect to correct or vice versa. The paper shows how MMLU might not decrease as we prune layers or do quantization,but that's because some incorrect answers might have "flipped" to become correct. Our goal is to match the original model, so measuring "flips" is a good metric.
KL Divergence should be one of the gold standards for reporting quantization errors as per the research paper "Accuracy is Not All You Need". Using perplexity is incorrect since output token values can cancel out, so we must use KLD or harder benchmarks like Aider.
The paper also shows that interestingly KL Divergence is highly correlated with flips, and so our goal is to reduce the mean KL Divergence whilst increasing the disk space of the quantization as less as possible.
Most frameworks report perplexity and KL Divergence using a test set of Wikipedia articles. However, we noticed using the calibration dataset which is also Wikipedia related causes quants to overfit, and attain lower perplexity scores. We utilize Calibration_v3arrow-up-right and Calibration_v5arrow-up-right datasets for fair testing which includes some wikitext data amongst other data. Also instruct models have unique chat templates, and using text only calibration datasets is not effective for instruct models (base models yes). In fact most imatrix GGUFs are typically calibrated with these issues. As a result, they naturally perform better on KL Divergence benchmarks that also use Wikipedia data, since the model is essentially optimized for that domain.
To ensure a fair and controlled evaluation, we do not to use our own calibration dataset (which is optimized for chat performance) when benchmarking KL Divergence. Instead, we conducted tests using the same standard Wikipedia datasets, allowing us to directly compare the performance of our Dynamic 2.0 method against the baseline imatrix approach.
🔢 MMLU Replication Adventure
Replicating MMLU 5 shot was nightmarish. We could not replicate MMLU results for many models including Llama 3.1 (8B) Instruct, Gemma 3 (12B) and others due to subtle implementation issues. Llama 3.1 (8B) for example should be getting ~68.2%, whilst using incorrect implementations can attain 35% accuracy.
Llama 3.1 (8B) Instruct has a MMLU 5 shot accuracy of 67.8% using a naive MMLU implementation. We find however Llama tokenizes "A" and "_A" (A with a space in front) as different token ids. If we consider both spaced and non spaced tokens, we get 68.2% (+0.4%)
Interestingly Llama 3 as per Eleuther AI's LLM Harnessarrow-up-right also appends "The best answer is" to the question, following Llama 3's original MMLU benchmarks.
There are many other subtle issues, and so to benchmark everything in a controlled environment, we designed our own MMLU implementation from scratch by investigating github.com/hendrycks/testarrow-up-right directly, and verified our results across multiple models and comparing to reported numbers.
✨ Gemma 3 QAT Replication, Benchmarks
The Gemma team released two QAT (quantization aware training) versions of Gemma 3:
w = q * block_scale with each block having 32 weights. See llama.cpp wiki arrow-up-rightfor more details.We benchmarked all Q4_0 GGUF versions, and did extensive experiments on the 12B model. We see the 12B Q4_0 QAT model gets 67.07% whilst the full bfloat16 12B version gets 67.15% on 5 shot MMLU. That's very impressive! The 27B model is mostly nearly there!
MMLU 5 shot
26.12%
55.13%
67.07% (67.15% BF16)
70.64% (71.5% BF16)
Disk Space
0.93GB
2.94GB
7.52GB
16.05GB
Efficiency*
1.20
10.26
5.59
2.84
We designed a new Efficiency metric which calculates the usefulness of the model whilst also taking into account its disk size and MMLU 5 shot score:
Efficiency=MMLU 5 shot score−25Disk Space GB\text{Efficiency} = \frac{\text{MMLU 5 shot score} - 25}{\text{Disk Space GB}}Efficiency=Disk Space GBMMLU 5 shot score−25
We have to minus 25 since MMLU has 4 multiple choices - A, B, C or D. Assume we make a model that simply randomly chooses answers - it'll get 25% accuracy, and have a disk space of a few bytes. But clearly this is not a useful model.
On KL Divergence vs the base model, below is a table showcasing the improvements. Reminder the closer the KL Divergence is to 0, the better (ie 0 means identical to the full precision model)
IQ1_S
1.035688
5.83
0.972932
6.06
IQ1_M
0.832252
6.33
0.800049
6.51
IQ2_XXS
0.535764
7.16
0.521039
7.31
IQ2_M
0.26554
8.84
0.258192
8.96
Q2_K_XL
0.229671
9.78
0.220937
9.95
Q3_K_XL
0.087845
12.51
0.080617
12.76
Q4_K_XL
0.024916
15.41
0.023701
15.64
If we plot the ratio of the disk space increase and the KL Divergence ratio change, we can see a much clearer benefit! Our dynamic 2bit Q2_K_XL reduces KLD quite a bit (around 7.5%).
Truncated table of results for MMLU for Gemma 3 (27B). See below.
Our dynamic 4bit version is 2GB smaller whilst having +1% extra accuracy vs the QAT version!
Efficiency wise, 2bit Q2_K_XL and others seem to do very well!
IQ1_M
48.10
47.23
6.51
3.42
IQ2_XXS
59.20
56.57
7.31
4.32
IQ2_M
66.47
64.47
8.96
4.40
Q2_K_XL
68.70
67.77
9.95
4.30
Q3_K_XL
70.87
69.50
12.76
3.49
Q4_K_XL
71.47
71.07
15.64
2.94
Google QAT
70.64
17.2
2.65
chevron-rightClick here for Full Google's Gemma 3 (27B) QAT Benchmarks:hashtag
IQ1_S
41.87
43.37
6.06
3.03
IQ1_M
48.10
47.23
6.51
3.42
IQ2_XXS
59.20
56.57
7.31
4.32
IQ2_M
66.47
64.47
8.96
4.40
Q2_K
68.50
67.60
9.78
4.35
Q2_K_XL
68.70
67.77
9.95
4.30
IQ3_XXS
68.27
67.07
10.07
4.18
Q3_K_M
70.70
69.77
12.51
3.58
Q3_K_XL
70.87
69.50
12.76
3.49
Q4_K_M
71.23
71.00
15.41
2.98
Q4_K_XL
71.47
71.07
15.64
2.94
Q5_K_M
71.77
71.23
17.95
2.58
Q6_K
71.87
71.60
20.64
2.26
Q8_0
71.60
71.53
26.74
1.74
Google QAT
70.64
17.2
2.65
🦙 Llama 4 Bug Fixes + Run
We also helped and fixed a few Llama 4 bugs:
Llama 4 Scout changed the RoPE Scaling configuration in their official repo. We helped resolve issues in llama.cpp to enable this change herearrow-up-right
The Llama 4 team and vLLM also independently fixed an issue with QK Norm being shared across all heads (should not be so) herearrow-up-right. MMLU Pro increased from 68.58% to 71.53% accuracy.
Wolfram Ravenwolfarrow-up-right showcased how our GGUFs via llama.cpp attain much higher accuracy than third party inference providers - this was most likely a combination of the issues explained above, and also probably due to quantization issues.
As shown in our graph, our 4-bit Dynamic QAT quantization deliver better performance on 5-shot MMLU while also being smaller in size.
To run Llama 4 Scout for example, first clone llama.cpp:
Then download out new dynamic v 2.0 quant for Scout:
And and let's do inference!
Last updated 6 minutes ago
Was this helpful?
Was this helpful?
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp
# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
repo_id = "unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF",
local_dir = "unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF",
allow_patterns = ["*IQ2_XXS*"],
)
./llama.cpp/llama-cli \
--model unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF/Llama-4-Scout-17B-16E-Instruct-UD-IQ2_XXS.gguf \
--threads 32 \
--ctx-size 16384 \
--n-gpu-layers 99 \
-ot ".ffn_.*_exps.=CPU" \
--seed 3407 \
--prio 3 \
--temp 0.6 \
--min-p 0.01 \
--top-p 0.9 \
-no-cnv \
--prompt "<|header_start|>user<|header_end|>\n\nCreate a Flappy Bird game.<|eot|><|header_start|>assistant<|header_end|>\n\n"
Old 2/24 Q4_K_XL commit (pre bugfix files): https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF/commit/7...
Questions for a postmortem that the blog post left unanswered:
- Why the change? Is it just to improve PPL/KLD? Sure, we can assume PPL and KLD are not perfect benchmarks. If yes, then why change the quantization anyways? Or was the old 2/24 quant actually much worse performing in the real world?I presume the Q4_K_XL quant using mxfp4 was the issue? If the 2/24 files having a lower PPL is an actual issue due to low quality tensors, then why not just say that?
- What were the main tensors that had the quantizations changed from 2/24 to 2/27? Did you now quantize attention tensors differently? Or perhaps ssm? T
- What was it changed from? Was it changed from mxfp4 or q4_k to q8, or something else?
A quick sentence in the blog post saying "ok, we've confirmed that using mxfp4 (or q3 or whatever) in the attention/ssm/biases/norms/etc is a bad idea, we had that in our old models on 2/24 and our new models today are better" that would make it clear. As it's written, it's trying to both say "PPL/KLD don't actually reflect real world quality" and "we changed our quant to increase PPL/KLD" at the same time, which seems contradictory.
Now shrinking them sure, but I’ve seen nothing that indicates you can just page weights in and out without cratering your performance like you would with a non MoE model
No your conclusion is false - only the old Q4_K_XL had slightly higher perplexity, all other quants are fine. We uploaded 9TB of research artifacts to https://huggingface.co/unsloth/Qwen3.5-35B-A3B-Experiments-G... for the community.
If you read our blog, it says KLD and PPL are actually sometimes counterintuitive - for example MiniMax some of our quants do worse on PPL and KLD vs AesSedai's one for example, but does worse on LiveCodeBench by a lot see https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks#id-3-...
This is because see https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks#id-1-... - although bitwidths are in general monotonic ie q2_k < q3_k < q4_k < q5_k etc, we find KLD and PPL are actually not monotonic ie q3_k can actually have BETTER PPL than q4_k.
So the main point is bad luck on quantization - sometimes lower bits might get lower PPL and KLD, but actually this is a ruse and wrong, since on actual real world tasks, it's worse.
Unsloth Q4_K_M
PPL: 6.6053 KLD 99.9%: 0.5478 KLD mean: 0.0192
bartowski Qwen_Q4_K_M
PPL: 6.6097 KLD 99.9%: 0.5771 KLD mean: 0.0182
Barely noticeable drop in PPL; noticeable KLD drop (good, 5%); but worse KLD mean (bad, 5%).So then why was Q4_K_XL having issues? Is it just a PPL issue that doesn't reflect in real world usage? If yes, why not just say that? "The Q4_K_XL had lower PPL, but don't worry, PPL can be wrong, and other benchmarks show it's fine". If it was a real quality issue, then where was the issue caused by?
The blog post says "Retiring MXFP4 from all GGUF quants: Q2_K_XL, Q3_K_XL and Q4_K_XL, except for pure MXFP4_MOE" but doesn't say why. The easy assumption that most people would make is "oh, you quanted attention or ssn or something to mxfp4 and that turned out to be bad, so you retire mxfp4" but if you say that it's not that, then what's the actual issue?