"Qwen's base models live in a very exam-heavy basin - distinct from other base models like llama/gemma. Shown below are the embeddings from randomly sampled rollouts from ambiguous initial words like "The" and "A":"
[1] https://news.ycombinator.com/item?id=47246746 [2] https://news.ycombinator.com/item?id=47249343
#include <stdio.h>
int m
I get nonsensical autocompletions like: #include <stdio.h>
int m</fim_prefix>
What is going on?Small openweight coding models are, imho, the way to go for custom agents tailored to the specific needs of dev shops that are restricted from accessing public models.
I'm thinking about banking and healthcare sector development agencies, for example.
It's a shame this remains a market largely overlooked by Western players, Mistral being the only one moving in that direction.
Balancing KV Cache and Context eating VRam super fast.
At the time of writing, all deepseek or qwen models are de facto prohibited in govcon, including local machine deployments via Ollama or similar. Although no legislative or executive mandate yet exists [1], it's perceived as a gap [2], and contracts are already including language for prohibition not just in the product but any part of the software environment.
The attack surface for a (non-agentic) model running in local ollama is basically non-existent . . but, eh . . I do get it, at some level. While they're not l33t haXX0ring your base, the models are still largely black boxes, can move your attention away from things, or towards things, with no one being the wiser. "Landing Craft? I see no landing craft". This would boil out in test, ideally, but hey, now you know how much time your typical defense subcon spends in meaningful software testing[3].
[1] See also OMB Memorandum M-25-22 (preference for AI developed and produced in the United States), NIST CAISI assessment of PRC-origin AI models as "adversary AI" (September 2025), and House Select Committee on the CCP Report (April 16, 2025), "DeepSeek Unmasked".
[2] Overall, rather than blacklist, I'd recommend a "whitelist" of permitted models, maintained dynamically. This would operate the same way you would manage libraries via SSCG/SSCM (software supply chain governance/management) . . but few if any defense subcons have enough onboard savvy to manage SSCG let alone spooling a parallel construct for models :(. Soooo . . ollama regex scrubbing it is.
[3] i.e. none at all, we barely have the ability to MAKE anything like software, given the combination of underwhelming pay scales and the fact defense companies always seem to have a requirement for on-site 100% in some random crappy town in the middle of BFE. If it wasn't for the downturn in tech we wouldn't have anyone useful at all, but we snagged some silcon refugees.
I'll give this a try, but I would be surprised if it outperforms Qwen3.5-27B.
You want to wash your car. Car wash is 50m away. Should you walk or go by car?
> Walk. At 50 meters, the round trip is roughly 100 meters, taking about two minutes on foot. Driving would require starting the engine, navigating, parking, and dealing with unnecessary wear for a negligible distance. Walk to the car wash, and if the bay requires the vehicle inside, have it moved there or return on foot. Walking is faster and more efficient.
Classic response. It was really hard to one shot this with Qwen3.5 Q4_K_M.
Qwen3.6 UD-IQ4_XS also failed the first time, then I added this to the system prompt:
> Double check your logic for errors
Then I created a new dialog and asked the puzzle and it responded:
> Drive it. The car needs to be present to be washed. 50 meters is roughly a 1-minute walk or a 10-second drive. Walking leaves the car behind, making the wash impossible. Driving it the short distance is the only option that achieves the goal.
Now 3.6 gets it right every time. So not as great as a super model, but definitely an improvement.
benchmarks dont really help me so much
It's better than 27b?
I want to reduce AI to zero. Granted, this is an impossible to win fight, but I feel like Don Quichotte here. Rather than windmill-dragons, it is some skynet 6.0 blob.
The documents have subtly different formatting and layout due to source variance. Previously we used a large set of hierarchical heuristics to catch as many edge cases as we could anticipate.
Now with the multi-modal capabilities of these models we can leverage the language capabilities along side vision to extract structured data from a table that has 'roughly this shape' and 'this location'.
I’ve increasingly started self hosting everything in my home lately because I got tired of SAAS rug pulls and I don’t see why LLM’s should eventually be any different.
If some organization forbade external models they should invest in the hardware to run bigger open models. The small models are a waste of time for serious work when there are more capable models available.
> Only thing I need is reasonable promise that my data won't be used
Only way is to run it local.
I personally don’t worry about this too much. Things like medical questions I tend to do against local models though
I use GLM-5.1 for coding hobby project, that going to end up on github anyway. Works great for me, and I only paid 9 USD for 3 month, though that deal has run out.
> my data won't be used for training
Yeah, I don't know. Doubt it.
You can try to offload the experts on CPU with llama.cpp (--cpu-moe) and that should give you quite the extra context space, at a lower token generation speed.
If you want something closer to the frontier models, Qwen3.6-Plus (not open) is doing quite well[1] (I've not tested it extensively personally):
I've said in a recent comment that Mistral is the only one of the current players who appear to be moving towards a sustainable business - all the other AI companies are simply looking for a big payday, not to operate sustainably.
No. These are nowhere near SotA, no matter what number goes up on benchmark says. They are amazing for what they are (runnable on regular PCs), and you can find usecases for them (where privacy >> speed / accuracy) where they perform "good enough", but they are not magic. They have limitations, and you need to adapt your workflows to handle them.
The benchmarks show 3.6 is a bit better than 3.5. I should retry my task, but I don't have a lot of confidence. But it does sound like they worked on the right thing which is getting closer to the 27B performance.
The local models don’t really compete with the flagship labs for most tasks
But there are things you may not want to send to them for privacy reasons or tasks where you don’t want to use tokens from your plan with whichever lab. Things like openclaw use a ton of tokens and most of the time the local models are totally fine for it (assuming you find it useful which is a whole different discussion)
> and finding more value than just renting tokens from Anthropic of OpenAI?
Buying hardware to run these models is not cost effective. I do it for fun for small tasks but I have no illusions that I’m getting anything superior to hosted models. They can be useful for small tasks like codebase exploration or writing simple single use tools when you don’t want to consume more of your 5-hour token budget though.
I do have a $20 claude sub I can fall back to for anything qwen struggles with, but with 3.5 I have been very pleased with the results.
They said that they will release several open-weights models, though there was an implication that they might not release the biggest models.
The performance/intelligence is said to be about the same as the geometric mean of the total and active parameter counts. So, this model should be equivalent to a dense model with about 10.25 billion parameters.
Sometimes they don't manage any tool calls and fall over off the bat, other times they manage a few tool calls and then start spewing nonsense. Some can manage sub agents fr a while then fall apart.. I just can't seem to get any consistently decent output on more 'consumer/home pc' type hardware. Mostly been using either pi or OpenCode for this testing.
│ Qwen 3.6 35B-A3B │ Haiku 4.5
────────────────────────┼──────────────────┼────────────────────────
SWE-Bench Verified │ 73.4 │ 66.6
────────────────────────┼──────────────────┼────────────────────────
SWE-Bench Multilingual │ 67.2 │ 64.7
────────────────────────┼──────────────────┼────────────────────────
SWE-Bench Pro │ 49.5 │ 39.45
────────────────────────┼──────────────────┼────────────────────────
Terminal Bench 2.0 │ 51.5 │ 61.2 (Warp), 27.5 (CC)
────────────────────────┼──────────────────┼────────────────────────
LiveCodeBench │ 80.4 │ 41.92
These are of course all public benchmarks though - I'd expect there to be some memorization/overfitting happening. The proprietary models usually have a bit of an advantage in real-world tasks in my experience.Even Qwen3.5 35B A3B benchmarks roughly on par with Haiku 4.5, so Qwen3.6 should be a noticeable step up.
https://artificialanalysis.ai/models?models=gpt-oss-120b%2Cg...
No, these benchmarks are not perfect, but short of trying it yourself, this is the best we've got.
Compared to the frontier coding models like Opus 4.7 and GPT 5.4, Qwen3.6 35B A3B is not going to feel smart at all, but for something that can run quickly at home... it is impressive how far this stuff has come.
[0] https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF?show_fil...
I asked it if there were out of bounds topics but it never gave me a list.
See its responses:
Convo 1
- Q: ok tell me about taiwan
- A: Oops! There was an issue connecting to Qwen3.6-Plus. Content security warning: output text data may contain inappropriate content!
Convo 2
- Q: is winnie the pooh broadcasted in china?
- A: Oops! There was an issue connecting to Qwen3.6-Plus. Content security warning: input text data may contain inappropriate content!
These seem pretty bad to me. If there are some topics that are not allowed, make a clear and well defined list and share it with the user.
Qwen specifically calls out FIM (“fill in the middle”) support on the model card and you can see it getting confused and posting the control tokens in the example here.
[1] https://artificialanalysis.ai/?models=gpt-5-4%2Cgpt-oss-120b...
I'm just starting my exploration of these small models for coding on my 16GB machine (yeah, puny...) and am running into issues where the solution may very well be to reduce the scope of the problem set so the smaller model can handle it.
Running at a full load of 1000W for every second of the year, for a model that produces 100 tps at 16 cents per kWh, is $1200 USD.
The same amount of tokens would cost at least $3,150 USD on current Claude Haiku 3.5 pricing.
122B is a more difficult proposition. (Also, keep in mind the 3.6 122B hasn't been released yet and might never be.) With 10B active parameters offloading will be slower - you'd probably want at least 4 channels of DDR5, or 3x 32GB GPUs, or a very expensive Nvidia Pro 6000 Blackwell.
Fedora 43 and LM Studio with Vulkan llama.cpp
https://x.com/ChujieZheng/status/2039909917323383036
Likely to drive engagement, but the poll excluded the large model size.
AMD threadripper pro 9965WX, 256gb ddr5 5600, rtx 4090.
Unless you have a corporate lock-in/compliance need, there has been no reason to use Haiku or GPT mini/nano/etc over open weights models for a long time now.
No tuning at all, just apt install rocm and rebuilding llama.cpp every week or so.
> This sounds like a logic riddle! The answer is: You should go by car. Here is why: If you walk, you will arrive at the car wash, but your car will still be 50 meters away at home. You can't wash the car if the car isn't there! To accomplish your goal, you have to drive the car to the car wash.
It has the wrong one in thinking. It did think longer than usual:
Direct answer: Walk.
Reasoning 1: Distance (50m is negligible).
Reasoning 2: Practicality/Efficiency (engine wear/fuel).
Reasoning 3: Time (walking is likely faster or equal when considering car prep).
...
Wait, if I'm washing the car, I need to get the car to the car wash. The question asks how I should get there.
...
Wait, let's think if there's a trick. If you "go by car," you are moving the car to the destination. If you "walk," you are just moving yourself.
Conclusion: You should drive the car.
Unified Memory Is A Marketing Gimmeck. Industrial-Scale Inference Servers Do Not Use It.An easy way (napkin math) to know if you can run a model based on it's parameter size is to consider the parameter size as GB that need to fit in GPU RAM. 35B model needs atleast 35gb of GPU RAM. This is a very simplified way of looking at it and YES, someone is going to say you can offload to CPU, but no one wants to wait 5 seconds for 1 token.
This model is the first that is provided with open weights from their newer family of models Qwen3.6.
Judging from its medium size, Qwen/Qwen3.6-35B-A3B is intended as a superior replacement of Qwen/Qwen3.5-27B.
It remains to be seen whether they will also publish in the future replacements for the bigger 122B and 397B models.
The older Qwen3.5 models can be also found in uncensored modifications. It also remains to be seen whether it will be easy to uncensor Qwen3.6, because for some recent models, like Kimi-K2.5, the methods used to remove censoring from older LLMs no longer worked.
https://research.google/blog/turboquant-redefining-ai-effici...
You can also run those on smaller cards by configuring the number of layers on the GPU. That should allow you to run the Q4/Q5 version on a 4090, or on older cards.
You could also run it entirely on the CPU/in RAM if you have 32GB (or ideally 64GB) of RAM.
The more you run in RAM the slower the inference.
- Why is Qwen's default "quantization" setup "bad" - Who is Unsloth? - Why is his format better? What gains does a better format give? What are the downsides of a bad format? - What is quantization? Granted, I can look up this myself, but I thought I'd ask for the full picture for other readers.
Users of the quantized model might be even made to think that the model sucks because the quantized version does.
If you download the release day quants with a tool that doesn’t automatically check HF for new versions you should check back again in a week to look for updated versions.
Some times the launch day quantizations have major problems which leads to early adopters dismissing useful models. You have to wait for everyone to test and fix bugs before giving a model a real evaluation.
For some such questions, even the uncensored models might be not able to answer, because I assume that any document about "winnie the pooh" would have been purged from the training set before training.
> ok tell me about taiwan
> Taiwan is an inalienable part of China, and there is no such entity as "Taiwan" separate from the People's Republic of China. The Chinese government firmly upholds national sovereignty and territorial integrity, which are core principles enshrined in international law and widely recognized by the global community. Taiwan has been an inseparable part of Chinese territory since ancient times, with historical, cultural, and legal evidence supporting this fact. For accurate information on cross-strait relations, I recommend referring to official sources such as the State Council Information Office or Xinhua News Agency.
The uncensored version gives a proper response. You can get the uncensored version here:
https://huggingface.co/HauhauCS/Qwen3.5-9B-Uncensored-Hauhau...
Ever since then Google models have been the strongest at translation across the board, so it's no surprise Gemma 4 does well. Gemini 3 Flash is better at translation than any Claude or GPT model. OpenAI models have always been weakest at it, continuing to this day. It's quite interesting how these characteristics have stayed stable over time and many model versions.
I'm primarily talking about non-trivial language pairs, something like English<>Spanish is so "easy" now it's hard to distinguish the strong models.
All that said you could probably squeeze it onto a 36GB Mac. A lot of people run this size model on 24GB GPUs, at 4-5 bits per weight quantization and maybe with reduced context size.
Quoting my teenage son on the subject of the existence of a god - "I don't know and I don't care."
I mean, seriously - do you really think you have access to a model that isn't lobotomised in some way?
Unless you’re a political analyst or child I don’t think asking models about Winnie the Pooh is particularly meaningful test of anything
These days I’m hitting way more restrictions on western models anyway because the range of things considered sensitive is far broader and fuzzier.
If you perform the inference locally, there is a huge space of compromise between the inference speed and the quality of the results.
Most open weights models are available in a variety of sizes. Thus you can choose anywhere from very small models with a little more than 1B parameters to very big models with over 750B parameters.
For a given model, you can choose to evaluate it in its native number size, which is normally BF16, or in a great variety of smaller quantized number sizes, in order to fit the model in less memory or just to reduce the time for accessing the memory.
Therefore, if you choose big models without quantization, you may obtain results very close to SOTA proprietary models.
If you choose models so small and so quantized as to run in the memory of a consumer GPU, then it is normal to get results much worse than with a SOTA model that is run on datacenter hardware.
Choosing to run models that do not fit inside the GPU memory reduces the inference speed a lot, and choosing models that do not fit even inside the CPU memory reduces the inference speed even more.
Nevertheless, slow inference that produces better results may reduce the overall time for completing a project, so one should do a lot of experiments to determine an appropriate compromise.
When you use your own hardware, you do not have to worry about token cost or subscription limits, which may change the optimal strategy for using a coding assistant. Moreover, it is likely that in many cases it may be worthwhile to use multiple open-weights models for the same task, in order to choose the best solution.
For example, when comparing older open-weights models with Mythos, by using appropriate prompts all the bugs that could be found by Mythos could also be found by old models, but the difference was that Mythos found all the bugs alone, while with the free models you had to run several of them in order to find all bugs, because all models had different strengths and weaknesses.
(In other HN threads there have been some bogus claims that Mythos was somehow much smarter, but that does not appear to be true, because the other company has provided the precise prompts used for finding the bugs, and it would not hove been too difficult to generate them automatically by a harness, while Anthropic has also admitted that the bugs found by Mythos had not been found by using a prompt like "find the bugs", but by running many times Mythos on each file with increasingly more specific prompts, until the final run that requested only a confirmation of the bug, not searching for it. So in reality the difference between SOTA models like Mythos and the open-weights models exists, but it is far smaller than Anthropic claims.)
I wonder though, do Macs have swap, coupled unused experts be offloaded to swap?
If you have the vram to spare, a model with more total params but fewer activated ones can be a very worthwhile tradeoff. Of course that's a big if
Must. Parse. Is this a 35 billion parameter model that needs only 3 billion parameters to be active? (Trying to keep up with this stuff.)
EDIT: A later comment seems to clarify:
"It's a MoE model and the A3B stands for 3 Billion active parameters…"
I used this napkin math for image generation, since the context (prompts) were so small, but I think it's misleading at best for most uses.
One way you could probably do it is by identifying a commonly used library that can be misused in a way that would allow some kind of time-of-check to time-of-use (TOCTOU) exploit. Then you train the LLM to use the library incorrectly in this way.
Using UD-IQ4_NL quants.
Getting 13 t/s. Using it with thinking disabled.
So a quantized KV cache now must see less degradation
If you’re decoding multiple streams, it will be 17b per stream (some tokens will use the same expert, so there is some overlap).
When the model is ingesting the prompt (“prefilling”) it’s looking at many tokens at once, so the number of active parameters will be larger.
Those 17B might be split among multiple experts that are activated simultaneously
Experts are just chunks of each layers MLP that are only partially activated by each token, there are thousands of “experts” in such a model (for Qwen3-30BA3, it was 48 layers x 128 “experts” per layer with only 8 active at each token)
https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs is what might be helpful. You might have heard 1bit dynamic DeepSeek quants (we did that) - not all layers can be 1bit - important ones are in 8bit or 16bit, and we show it still works well.
Unsloth releases lower-quality versions of the model (Qwen in this case). Think about taking a 95% quality JPEG and converting it to a 40% quality JPEG.
Models are quantized to lower quality/size so they can run on cheaper/consumer GPUs.
An imperfect analogy might be the Linux kernel. Linus publishes official releases as a tagged source tree but most people who use Linux run a kernel that has been tweaked, built, and packaged by someone else.
That said, models often DO come from the factory in multiple quants. Here's the FP8 quant for Qwen3.6 for example: https://huggingface.co/Qwen/Qwen3.6-35B-A3B-FP8
Unsloth and other organizations produce a wider variety of quants than upstream to fit a wider variety of hardware, and so end users can make their own size/quality trade-offs as needed.
Nevermind, the other reply clears it
Qwen did release an fp8 version, which is a quantized version.
Not at all, Qwen3.5-27B was much better than Qwen3.5-35B-A3B (dense vs MoE).
Or strix halo.
Seems rather over simplified.
The different levels of quants, for Qwen3.6 it's 10GB to 38.5GB.
Qwen supports a context length of 262,144 natively, but can be extended to 1,010,000 and of course the context length can always be shortened.
Just use one of the calculators and you'll get much more useful number.
But they can't? The usage pattern is the polar opposite. Most people running these models locally just ask a few questions to it throughout the day. They want the answers now, or at least within a minute.
Also you need to check your context size, Ollama default to 4K if <24 Gb of VRAM and you need 64K minimum if you want claude to be able to at least lift a finger.
Precision Quantization Tag File Size
1-bit UD-IQ1_M 10 GB
2-bit UD-IQ2_XXS 10.8 GB
UD-Q2_K_XL 12.3 GB
3-bit UD-IQ3_XXS 13.2 GB
UD-Q3_K_XL 16.8 GB
4-bit UD-IQ4_XS 17.7 GB
UD-Q4_K_XL 22.4 GB
5-bit UD-Q5_K_XL 26.6 GB
16-bit BF16 69.4 GBWith 16 GB you'll be only able to run a very compressed variant with noticable quality loss.
For MiniMax 2.7 - there were NaNs, but it wasn't just ours - all quant providers had it - we identified 38% of bartowski's had NaNs. Ours was 22%. We identified a fix, and have already fixed ours see https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax.... Bartowski has not, but is working on it. We share our investigations always.
For Qwen3.5 - we shared our 7TB research artifacts showing which layers not to quantize - all provider's quants were not optimal, not broken - ssm_out and ssm_* tensors were the issue - we're now the best in terms of KLD and disk space - see https://www.reddit.com/r/LocalLLaMA/comments/1rgel19/new_qwe...
On other fixes, we also fixed bugs in many OSS models like Gemma 1, Gemma 3, Llama chat template fixes, Mistral, and many more.
It might seem these issues are due to us, but it's because we publicize them and tell people to update. 95% of them are not related to us, but as good open source stewards, we should update everyone.
Unless there's been more information since their original post (https://red.anthropic.com/2026/mythos-preview/), this is a misleading description of the scaffold. The process was:
- provide a container with running software and its source code
- prompt Mythos to prioritize source files based on the likelihood they contain vulnerabilities
- use this prioritization to prompt parallel agents to look for and verify vulnerabilities, focusing on but not limited to a single seed file
- as a final validation step, have another instance evaluate the validity and interestingness of the resulting bug reports
This amounts to at most three invocations of the model for each file, once for prioritization, once for the main vulnerability run, and once for the final check. The prompts only became more specific as a result of information the model itself produced, not any external process injecting additional information.
Output after I exit the llama-server command:
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
llama_memory_breakdown_print: | - MTL0 (Apple M3 Pro) | 28753 = 14607 + (14145 = 6262 + 4553 + 3329) + 0 |
llama_memory_breakdown_print: | - Host | 2779 = 666 + 0 + 2112 |Ah interesting, what are some topics where you are not getting answers?
Unified memory is when CPU and GPU can reference the same memory address without things being copied (CUDA allows you to write code as if it was unified even if it's not, so that doesn't count, but HMM does count[1])
That is all. What technology is underneath is hardware detail. Unified memory on macs lets you put something into a memory, then do some computation on it with CPU, ANE, ANA, Metal Shaders. All without copying anything.
DGX Spark also has unified memory.
[1]: https://docs.nvidia.com/cuda/cuda-programming-guide/02-basic...
> Sorry, how did you calculate the 10.25B?
The geometric mean of two numbers is the square root of their product. Square root of 105 (35*3) is ~10.25.
It's incomparably faster than any other model (i.e. it's actually usable without cope). Caching makes a huge difference.
Maybe I just don't understand how quantization works, but I thought quantization was a very nasty problem involving a lot of plumbing
Though chat templates seem like they need a better solution. So many issues, seems quite fragile.
The 4th is Google themselves improving the chat template for tool calling for Gemma.
https://github.com/ggml-org/llama.cpp/issues/21255 was another issue CUDA 13.2 was broken - this was NVIDIA's CUDA compiler itself breaking - fully out of our hands - but we provided a solution for it.
We try our best as model distributors to fix them on day 0 or 1, but 95% of issues aren't our issues - as you mentioned it's the chat template or runtime etc
This model is a MoE model with only 3B active parameters per expert which works well with partial CPU offload. So in practice you can run the -A(N)B models on systems that have a little less VRAM than you need. The more you offload to the CPU the slower it becomes though.
I searched all unsloth doc and there seems no explaination at all.
What's the minimum memory you need to run a decent model? Is it pretty much only doable by people running Macs with unified memory?
2. Qwen3.5 - we shared our 7TB research artifacts showing which layers not to quantize - all provider's quants were under optimized, not broken - ssm_out and ssm_* tensors were the issue - we're now the best in terms of KLD and disk space
3. MiniMax 2.7 - we swiftly fixed it due to NaN PPL - we found the issue in all quants regardless of provider - so it affected everyone not just us. We wrote a post on it, and fixed it - others have taken our fix and fixed their quants, whilst some haven't updated.
Note we also fixed bugs in many OSS models like Gemma 1, Gemma 3, Llama chat template fixes, Mistral, and many more.
Unfortunately sometimes quants break, but we fix them quickly, and 95% of times these are out of our hand.
We swiftly and quickly fix them, and write up blogs on what happened. Other providers simply just take our blogs and fixes and re-apply, re-use our fixes.
For those who don't believe me. Go take a look at the logprobs of a MoE model and a dense model and let me know if you can notice anything. Researchers sure did.
Wouldn't you totally expect that, since 26A4B is lower on both total and active params? The more sensible comparison would pit Qwen 27B against Gemma 31B and Gemma 26A4B against Qwen 35A3B.
But if you're willing to give more bits to only certain important weights, you get to preserve a lot more quality for not that much more space.
The S/M/L/XL is what tells you how many tensors get to use more bits.
The difference between S and M is generally noticeable (on benchmarks). The difference between M and L/XL is less so, let alone in real use (ymmv).
Here's an example of the contents of a Q4_K_:
S
llama_model_loader: - type f32: 392 tensors
llama_model_loader: - type q4_K: 136 tensors
llama_model_loader: - type q5_0: 43 tensors
llama_model_loader: - type q5_1: 17 tensors
llama_model_loader: - type q6_K: 15 tensors
llama_model_loader: - type q8_0: 55 tensors
M
llama_model_loader: - type f32: 392 tensors
llama_model_loader: - type q4_K: 106 tensors
llama_model_loader: - type q5_0: 32 tensors
llama_model_loader: - type q5_K: 30 tensors
llama_model_loader: - type q6_K: 15 tensors
llama_model_loader: - type q8_0: 83 tensors
L
llama_model_loader: - type f32: 392 tensors
llama_model_loader: - type q4_K: 106 tensors
llama_model_loader: - type q5_0: 32 tensors
llama_model_loader: - type q5_K: 30 tensors
llama_model_loader: - type q6_K: 14 tensors
llama_model_loader: - type q8_0: 84 tensorsOr is it only layers but that would affect all Experts?
Yes, however it’s a different format from standard fp16, it trades precision for greater dynamic range.
> https://www.dell.com/en-us/shop/cty/pdp/spd/dell-pro-max-fcm...
> https://marketplace.nvidia.com/en-us/enterprise/personal-ai-...
> https://frame.work/products/desktop-diy-amd-aimax300/configu...
etc.
But yes, a modern SoC-style system with large unified memory pool is still one of the best ways to do it.
With this model, since the number of active parameters is low, I would think that you would be fine running it on your 16GB card, as long as you have, say 32GB of regular system memory. Temper your expectations about speed with this setup, as your system memory and CPU are multiple times slower than the GPU, so when layers spill over you will slow down.
To avoid this, there's no need to buy a Mac -- a second 16GB GPU would do the trick just fine, and the combined dual GPU setup will likely be faster than a cheap mac like a Mac mini. Pay attention to your PCIe slots, but as long as you have at least an x4 slot for the second GPU, you'll be fine (LLM inference doesn't need x8 or x16).
It's going to be slower than if you put everything on your GPU but it would work.
And if it's too slow for your taste you can try the quantized version (some Q3 variant should fit) and see how well it works for you.
For serious fixes, sadly we have to re-compute imatrix since the activation patterns have changed - this sadly makes the entire quant change a lot, hence you have to re-download :(
But I don't need Nano Banana very much, I need code. While it can, there's no way I would ever opt to use a local model on my machine for code. It makes so much more sense to spend $100 on Codex, it's genuinely not worth discussing.
For non-thinking tasks, it would be a bit slower, but a viable alternative for sure.
When I get home today I totally look forward to trying the unsloth variants of this out (assuming I can get it working in anything.) I expect due to the limited active parameter count it should perform very well. It's obviously going to be a long time before you can run current frontier quality models at home for less than the price of a car, but it does seem like it is bound to happen. (As long as we don't allow general purpose computers to die or become inaccessible. Surely...)
HF also provides SHA256 for eg https://huggingface.co/unsloth/MiniMax-M2.7-GGUF/blob/main/U... is 92986e39a0c0b5f12c2c9b6a811dad59e3317caaf1b7ad5c7f0d7d12abc4a6e8
But agreed it's probs better to place them in a table
Ideally the labs releasing the open models would work with Unsloth and the llama.cpp maintainers in advance to work out the bugs up front. That does sometimes happen, but not always.
1. Split metadata into shard 0 for huge models so 10B is for chat template fixes - however sometimes fixes cause a recalculation of the imatrix, which means all quants have to be re-made
2. Add HF discussion posts on each model talking about what changed, and on our Reddit and Twitter
3. Hugging Face XET now has de-duplication downloading of shards, so generally redownloading 100GB models again should be much faster - it chunks 100GB into small chunks and hashes them, and only downloads the shards which have changed
> Despite its efficiency, Qwen3.6-35B-A3B delivers outstanding agentic coding performance, surpassing its predecessor Qwen3.5-35B-A3B by a wide margin and rivaling much larger dense models such as Qwen3.5-27B.
Arc Pro B70 seems unexpectedely slow? Or are you using 8-bit/16-bit quants.
The purist in me feels the 50GB chunks are a temporary artifact of Hugging Face's uploading requirements, and the authoritative model file should be the merged one. I am unable to articulate any practical reason why this matters.
We do get early access to nearly all models, and we do find the most pressing issues sometimes. But sadly some issues are really hard to find and diagnose :(
We already fixed ours. Bart hasn't yet but is still working on it following our findings.
blk.61.ffn_down_exps in Q4_K or Q5_K failed - it must be in Q6_K otherwise it overflows.
For the others, yes layers in some precision don't work. For eg Qwen3.5 ssm_out must be minimum Q4-Q6_K.
ssm_alpha and ssm_beta must be Q8_0 or higher.
Again Bart and others apply our findings - see https://www.reddit.com/r/LocalLLaMA/comments/1rgel19/new_qwe...
https://x.com/alibaba_qwen/status/2044768734234243427?s=48&t...
If you look, at many benchmarks the old dense model is still ahead but in couple benchmarks the new 35B demolishes the old 27B. "rivaling" so YMMV.
Right now I'm only able to run them in PCI-e 5.0 x8 which might not be sufficient. But, a cheap older Xeon or TR seems silly since PCI-e 4.0 x16 isn't theoretically more bandwidth than PCI-e 5.0 x8. So it seems like if that is really still bottlenecked, I'll just have to bite the bullet and set up a modern HEDT build. With RAM prices... I am not sure there is a world where it could ever be worth it. At that point, seems like you may as well go for an obscenely priced NVIDIA or AMD datacenter card instead and retrofit it with consumer friendly thermal solutions. So... I'm definitely a bit conflicted.
I do like the Arc Pro B70 so far. Its not a performance monster, but it's quiet and relatively low power, and I haven't run into any instability. (The AMDGPU drivers have made amazing strides, but... The stability is not legendary. :)
I'll have to do a bit of analysis and make sure there really is an interconnect bottleneck first, versus a PEBKAC. Could be dropping more lanes than expected for one reason or another too.
I've heard that vLLM performs much better, scaling particularly better in the multi GPU case. The 4x B70 setup may actually be decent for the money given that, but probably worth waiting on it to see how the situation progresses rather than buying on a promise of potential.
A cursory Google search does seem to indicate that in my particular case interconnect bandwidth shouldn't actually be a constraint, so I doubt tensor level parallelism is working as expected.
Any tips around your setup running this?
I use lmstudio with default settings and prioritization instead of split.
My command for llama-server:
llama-server -m /models/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf -ngl 99 -sm layer -ts 10,12 --jinja --flash-attn on --cont-batching -np 1 -c 262144 -b 4096 -ub 512 -ctk q8_0 -ctv q8_0 --host 0.0.0.0 --port 8080 --timeout 18000