At least this one gave credit to the upstream projects which it used as a reference.
The llama.cpp project is also getting a wave of vibecoded PRs that are very clearly being produced by pointing claude at the repo and the original paper and having it produce something.
Almost none of these attempts contain information that really matters, like actual benchmark tests with differen KV quantization levels (not just perplexity or KLD).
Going from paper to implementation from scratch in half an hour or so is great.
TurboQuant KV compression: We ported the V3 Lloyd-Max codebooks from the TurboQuant paper (Zandieh et al., ICLR 2026) into native C++ and fused dequantization into Metal shaders. This achieves a measured 4.3× KV cache compression at runtime, completely eliminating Python overhead.
SSD Expert Streaming: To fit a 122B parameter model (e.g., Qwen3.5-122B MoE) without triggering macOS VM swapping or Watchdog kernel kills, the full ~60 GB weight file remains on NVMe. Only the top-k active expert pages are streamed to the GPU per forward pass at ~9 GB/s. As a result, inference runs with only 2,694 MB of active GPU VRAM on the M5 Pro 64GB, while the OS page cache automatically handles hot-expert reuse.
By combining these two approaches, we can comfortably run massive models in memory-constrained environments on Apple Silicon.
Also tested QWEN 4B on IPHONE 13 Pro.
Code and implementation details: https://github.com/SharpAI/SwiftLM
Ofcourse large corps will have fancy proprietary models, but for every day queries and tasks, local feels like a huge, and just slightly out of reach.
Am i missing something fundamental?
Llama.cpp already has KV compression and one of the turbo quant PRs will get merged at some point.
If you don’t care about the fancy 3 bit, the q8 KV compression is good enough! Don’t bother with q4
./build/bin/llama-server -m model.gguf \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ -c 65536
Etc
./SwiftLM \
--model mlx-community/Qwen3.5-122B-A10B-4bit \
--stream-experts \
--port 5413
Error: [SwiftLM] Loading model: mlx-community/Qwen3.5-122B-A10B-4bit
[SwiftLM] Enabled Async SSD Streaming on directory: e9c67b08899964be5fdd069bb1b4bc8907fe68f5
[SwiftLM] Memory strategy: FULL GPU (69.6GB model, 133.4GB available)
[SwiftLM] Download: [===================>] 100% ⠋ (66395.4 MB / 66395.4 MB) | Speed: 0.0 MB/s
MLX error: Failed to load the default metallib. library not found library not found library not found library not found at /Users/runner/work/SwiftLM/SwiftLM/LocalPackages/mlx-swift/Source/Cmlx/mlx-c/mlx/c/stream.cpp:115One thing I've turned up in smaller models and I'm sort of winding my way toward verifying in larger ones is that if you train the MoE model from scratch with this kind of knockout / subset of experts baked in, then you get significantly better loss outcomes. In small models, it's actually better than training an MOE without conditioning on a reduced set of experts per pass.
Anyway, pretty cool. There's some Pareto-optimal curve based on memory bandwidth, amount of GPU / unified RAM and inference compute times for streaming stuff in.
It's not inherently bad in the same way that a first draft of a novel is not inherently bad.
But if someone asked me to read their novel and it was a first draft that they themselves had clearly not bothered reading or editing, I'd tell them to fuck off.
This repo isn’t showing that at all. Scroll to the bottom of the README and you’ll see the other project it was based on. It’s a translation of other people’s work.
There have been dozens or perhaps hundreds of vibecoded TurboQuant examples posted around the usual forums in the past few days. This one doesn’t even include anything helpful like benchmarks or tests. It’s just some proof of concept code that doesn’t even work if you try to run it.
My problem with this specific type of vibe coded project is that it’s initially presented as something more novel or polished in order to get more upvotes, karma, likes, or pad a resume. Then you read it and discover they just pointed Claude at some other projects and told it to produce something similar, then posted it as their own work.
Software is valuable if it has been tested and exercised properly by other people. I don't care if you vide coded it provided you then put the real work in to verify that it actually works correctly - and then include the proof that you've done that when you start widely sharing it with the world.
Right now it's impossible to tell which of these projects implementing the paper are worth spending time with.
Where’s the value added if the person just tells Claude to do it and then submits a PR?
The maintainers may as well vibe code it themselves if that’s all the work the would-be contributor is going to put into it.
git clone --recursive https://github.com/SharpAI/SwiftLM.git
cd SwiftLM
swift build -c release
# Trick to copy in that missing mlx.metallib file
uv run --with mlx-metal python -c "
import importlib.metadata, pathlib, shutil
d = importlib.metadata.distribution('mlx-metal')
metallib = pathlib.Path(d._path).parent / 'mlx/lib/mlx.metallib'
shutil.copy(metallib, '.build/release/')
print(f'Copied {metallib} -> .build/release/mlx.metallib')
# Now start the server (downloads 69GB Qwen model)
.build/release/SwiftLM \
--model mlx-community/Qwen3.5-122B-A10B-4bit \
--stream-experts \
--port 5413
But the server crashed when I tried to run a prompt through it: freed pointer was not the last allocationA blazingly fast, native Swift inference server that serves MLX models with a strict OpenAI-compatible API.
No Python runtime, no Global Interpreter Lock (GIL), no unnecessary memory copies. Just bare-metal Apple Silicon performance compiled to a single binary.
/v1/chat/completions, streaming, etc).--gpu-layers) and Wisdom Auto-Calibration for squeezing massive models into RAM.SwiftLM implements a hybrid V2+V3 TurboQuant architecture for on-the-fly KV cache compression. At roughly ~3.6 bits per coordinate overall, the KV cache is compressed ~3.5× vs FP16 with near-zero accuracy loss.
Recent reproductions of the TurboQuant algorithm (e.g., turboquant-mlx) revealed two distinct paths:
We built the "Holy Grail" hybrid: We ported the V3 non-linear Lloyd-Max codebooks directly into the native C++ encoding path, and process the dequantization natively in fused Metal (bggml-metal) shaders. This achieves V3 quality at V2 speeds, completely detached from Python overhead.
K-Cache (3-bit PolarQuant + 1-bit QJL) = 4.25 bits/dim
x̂ = x / ‖x‖V-Cache (3-bit PolarQuant) = 3.125 bits/dim Because the V-cache matrix is not used for inner-product attention scoring, the QJL error correction provides no benefit. We cleanly disable QJL for the V-cache, extracting an additional 25% memory savings without sacrificing quality.
Reference implementations: turboquant-mlx | turboquant_plus | Paper: TurboQuant, Google 2504.19874
To reliably run massive 122B parameter MoE models over SSD streaming, SwiftLM was designed and benchmarked natively on the following hardware:
⚠️ Quantization Disclaimer: While heavier quantization shrinks the required memory footprint, 4-bit quantization remains the strict production standard for MoE models. Our metrics indicated that aggressive 2-bit quantization heavily destabilizes JSON grammars—routinely producing broken keys like
\name\instead of"name"—which systematically breaks OpenAI-compatible tool calling.
A native iPhone & iPad companion app that downloads MLX models directly from HuggingFace and runs inference on-device via MLX Swift.
mlx-community model by namecd SwiftLMChat
python3 generate_xcodeproj.py # Generates SwiftLMChat.xcodeproj
open SwiftLMChat.xcodeproj
Then in Xcode:
Note for contributors: The
.xcodeprojis git-ignored (it contains your personal Team ID). Rungenerate_xcodeproj.pyafter cloning to regenerate it locally. Your Team ID is never committed.
Download the latest release tarball from the Releases page.
The archive is self-contained — default.metallib is bundled alongside the binary.
tar -xzf SwiftLM-<version>-macos-arm64.tar.gz
# Run from the extracted directory — default.metallib must be co-located with the binary
./SwiftLM --model mlx-community/Qwen2.5-3B-Instruct-4bit --port 5413
⚠️ Metal GPU Error? If you see
Failed to load the default metallib, it meansdefault.metallibis missing from the directory you are runningSwiftLMfrom. Make sure you run the binary from the extracted folder and do not move the binary without also movingdefault.metallibalongside it.
# Must clone recursively — default.metallib ships inside the mlx-swift submodule
git clone --recursive https://github.com/SharpAI/SwiftLM
cd SwiftLM
swift build -c release
default.metallib is a pre-built artifact inside the mlx-swift submodule, version-matched to the Swift binary. Copy it next to the binary before running:
cp LocalPackages/mlx-swift/Source/Cmlx/mlx/mlx/backend/metal/kernels/default.metallib \
.build/release/
.build/release/SwiftLM \
--model mlx-community/Qwen3.5-122B-A10B-4bit \
--stream-experts \
--port 5413
⚠️ Do NOT use Python's
mlx-metalpackage as a source formlx.metallib.
Whileuv run --with mlx-metal python -c "...shutil.copy(metallib, ...)"will get the server to start, the pipmlx-metalpackage is a different version of MLX than what this binary was compiled against. The version mismatch causes GPU kernel ABI corruption during inference, producing afreed pointer was not the last allocationcrash. Always use the metallib fromLocalPackages/mlx-swift/— it is the only version-matched artifact for this build.
(Add --stream-experts when running oversized MoE models like Qwen3.5 122B to bypass macOS virtual memory swapping and stream expert layers directly from NVMe.)
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Server health + loaded model capabilities |
/v1/models |
GET | List available models |
/v1/chat/completions |
POST | Chat completions (LLM and VLM support, multi-turn, system prompts) |
Drop-in compatible with standard OpenAI HTTP consumers:
curl http://localhost:5413/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3.5-122B-A10B-4bit",
"stream": true,
"messages": [
{"role": "system", "content": "You are Aegis-AI, a local home security agent. Output strictly in JSON format."},
{"role": "user", "content": "Clip 1: Delivery person drops package at 14:02. Clip 2: Delivery person walks away down driveway at 14:03. Do these clips represent the same security event? Output a JSON object with a `duplicate` boolean and a `reason` string."}
]
}'
| Option | Default | Description |
|---|---|---|
--model |
(required) | HuggingFace model ID or local path |
--port |
5413 |
Port to listen on |
--host |
127.0.0.1 |
Host to bind |
--max-tokens |
2048 |
Max tokens limit per generation |
--gpu-layers |
model_default |
Restrict the amount of layers allocated to GPU hardware |
--stream-experts |
false |
Enable experimental SSD streaming for MoE model expert matrices |
xcodebuild -downloadComponent MetalToolchain)Built entirely on the hard work of the Apple MLX community.
The TurboQuant KV cache compression implemented in SwiftLM is directly based on the following open-source work and research:
TheTom/llama-cpp-turboquant — The primary reference for the C and Metal GPU implementation. The turbo-wht.h Fast Walsh-Hadamard kernel, WHT sign arrays (seed=42), Lloyd-Max centroid tables, and the ggml-turbo-quant.c quantize/dequantize logic were ported directly from this repository into our MLX C++ and Metal backend.
TheTom/turboquant_plus — Python reference implementation used to validate the algorithm math, codebook construction (Lloyd's algorithm for N(0, 1/d)), and KV cache integration design.
TurboQuant Paper — "TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate", Zandieh et al., AISTATS/ICLR 2026. The two-stage PolarQuant + QJL algorithm described in Section 3 and Appendix A is the mathematical foundation of this implementation.
amirzandieh/QJL — Original Quantized Johnson-Lindenstrauss (QJL) 1-bit residual correction implementation by the paper authors.
MIT License
These are more like sending someone who didn't ask you a question a LMGTFY link they didn't ask for and expecting them to read all the results. Just a complete lack of awareness and respect for the maintainers
# Copy metallib next to the binary (one-time step) cp LocalPackages/mlx-swift/Source/Cmlx/mlx/mlx/backend/metal/kernels/default.metallib \ .build/release/
we live in a wholly unoptimized world because the available resources have been so high, while the benefits of optimizing have been so low. that has flipped now and there are tons of low hanging fruit to optimize.
I agree that benchmarks would be great, but thats only relevant to this one topic, not the overall agentic coded pull request concept itself
Use the version-matched metallib that's already in the repo:
cp LocalPackages/mlx-swift/Source/Cmlx/mlx/mlx/backend/metal/kernels/default.metallib \ .build/release/ .build/release/SwiftLM \ --model mlx-community/Qwen3.5-122B-A10B-4bit \ --stream-experts \ --port 5413 This is the exact metallib that was compiled alongside the Swift code — no version mismatch. Future pre-built releases will bundle it automatically.
A PR without evidence it works and expectations for the benefits using the new feature would bring is kind of worthless.
If it works in one case that doesn't mean it works consistently or well in the general case
I've made lots of things with Claude Code that just work... until I do things in a slightly different order and the whole thing explodes