You do not explain how any kind of predictor can work for MoE experts.
You do not explain how prediction can even be useful. I can predict the layers used in a dense model (all of them are used in order), but that doesn't help me much. It's still bottlenecked on bandwidth (hint: MoE doesn't change this).
Come on, "Run" is not the right word. "Crawl" is.
Headlines like that are misleading.
for a 1T model youd need to stream something like 2TB of weights per forward pass at fp16. even at peak sequential thats 300+ seconds per token which is... not great for interactive use but maybe fine for batch inference where you dont care about latency.
still a cool proof of concept though. the gap between 'can run' and 'runs usefully' is where things get interesting.
What makes this approach faster is that the model's access pattern is completely deterministic during
inference. You know exactly which tensors are needed next because transformer layers execute sequentially. So
you can issue large sequential reads and prefetch the next layer while the current one is computing on Metal.
The OS page cache can't do that β it has no concept of "layer N+1 comes after layer N."
For MoE it's even more stark. The OS would page in all 8 experts on the first token that routes to each one,
then evict them under memory pressure with LRU, which has no idea that expert 3 fires 10x more often than
expert 7. The neuron cache here is basically a domain-specific replacement policy.I do wonder in practice how the 'smarts' pan out, because putting a ton of stress on your NVMe during generation is probably not the best choice for it's longevity.
Isn't this missing the point of MoE models completely? MoE inference is sparse, you only read a small fraction of the weights per layer. You still have a problem of each individual expert-layer being quite small (a few MiBs each give or take) but those reads are large enough for the NVMe.
_ _
| | | |_ _ _ __ _ _ _ __ __ _
| |_| | | | | '_ \| | | | '__/ _` |
| _ | |_| | |_) | |_| | | | (_| |
|_| |_|\__, | .__/ \__,_|_| \__,_|
|___/|_|
Run models too big for your Mac's memory
Hypura is a storage-tier-aware LLM inference scheduler for Apple Silicon. It places model tensors across GPU, RAM, and NVMe tiers based on access patterns, bandwidth costs, and hardware capabilities β enabling models that exceed physical memory to run without crashing the system.
Run a 31 GB Mixtral 8x7B on a 32 GB Mac Mini at 2.2 tok/s. A 40 GB Llama 70B at 0.3 tok/s. Vanilla llama.cpp crashes on both.
Consumer hardware (MacBook Pro, Mac Studio) ships with fast unified memory and NVMe storage, but limited capacity. A 32 GB M1 Max cannot naively load a 40 GB model β the OS will swap-thrash until the OOM killer intervenes.
Hypura solves this by understanding the model architecture:
The result: models that would crash your machine under naive mmap become runnable. Models that fit in memory run at full Metal GPU speed with zero overhead.
Hypura reads the GGUF file, profiles your hardware (GPU working set, RAM, NVMe bandwidth), and solves a placement optimization that assigns every tensor to a tier:
recommendedMaxWorkingSetSize.F_NOCACHE + pread), prefetched ahead of the forward pass.Hypura selects the best inference mode automatically based on model size, architecture, and available memory:
Pool buffer size, prefetch depth, and memory budgets are computed automatically from your hardware profile β no manual tuning needed.
All benchmarks on M1 Max, 32 GB unified memory, ~5.1 GB/s NVMe sequential read.
| Model | Size | GPU | NVMe | Mode | Hypura | llama.cpp | Notes |
|---|---|---|---|---|---|---|---|
| Qwen 2.5 14B Q4_K_M | 8.4 GB | 8.4 GB | β | full-resident | 21 tok/s | ~21 tok/s | Fits in GPU; no overhead |
| Mixtral 8x7B Q5_K_M | 30.9 GB | 1.1 GB | 29.8 GB | expert-streaming | 2.2 tok/s | OOM | All layers on Metal; 99.5% cache hit rate |
| Llama 3.3 70B Q4_K_M | 39.6 GB | 7.8 GB | 31.8 GB | dense-FFN-streaming | 0.3 tok/s | OOM | All layers on Metal; dynamic 24-slot pool, 7-layer prefetch |
Key takeaway: For models that fit in memory, Hypura adds zero overhead. For models that don't fit, Hypura is the difference between "runs" and "crashes." Expert-streaming on Mixtral achieves usable interactive speeds by keeping only non-expert tensors on GPU and exploiting MoE sparsity (only 2/8 experts fire per token). Dense FFN-streaming extends this to non-MoE models like Llama 70B. Pool sizes and prefetch depth scale automatically with available memory.
Hypura builds from source with Cargo. You'll need Rust 1.75+ and CMake (for the vendored llama.cpp).
git clone --recurse-submodules https://github.com/hypura/hypura.git
cd hypura
cargo build --release
The binary is at target/release/hypura.
Homebrew tap coming soon.
# Profile your hardware (runs once, cached)
hypura profile
# Run inference on a GGUF model
hypura run ./model.gguf --prompt "Hello, world"
# Interactive chat
hypura run ./model.gguf --interactive
# Benchmark: Hypura scheduling vs naive baseline
hypura bench ./model.gguf
# Inspect model placement plan without loading
hypura inspect ./model.gguf
Start with --max-tokens 10 on untested models before scaling up.
Hypura exposes an Ollama-compatible HTTP API, making it a drop-in replacement for any tool that talks to Ollama β including OpenClaw.
hypura serve ./model.gguf
# Hypura serving Mixtral 8x7B Instruct v0.1
# Endpoint: http://127.0.0.1:8080
# Ollama-compatible API: /api/generate, /api/chat, /api/tags
| Endpoint | Description |
|---|---|
GET / |
Health check |
GET /api/tags |
List loaded model |
GET /api/version |
Server version |
POST /api/show |
Model metadata |
POST /api/generate |
Text completion (streaming NDJSON or single response) |
POST /api/chat |
Chat completion (streaming NDJSON or single response) |
Point OpenClaw at Hypura by setting the Ollama base URL in ~/.openclaw/openclaw.json:
{
"models": {
"providers": {
"ollama": {
"baseUrl": "http://127.0.0.1:8080",
"api": "ollama"
}
}
}
}
Or via the CLI:
openclaw config set models.providers.ollama.baseUrl "http://127.0.0.1:8080"
Hypura speaks native Ollama protocol (/api/chat with NDJSON streaming), so no compatibility shims are needed.
hypura serve <MODEL> [OPTIONS]
Options:
--host <HOST> Host to bind to [default: 127.0.0.1]
--port <PORT> Port to bind to [default: 8080]
--context <N> Maximum context length [default: 4096]
Hypura is a Cargo workspace with two crates:
hypura β Main binary and library. CLI in src/main.rs, all logic in src/lib.rs modules.hypura-sys β FFI bindings to llama.cpp (vendored at vendor/llama.cpp/, built via CMake).| Module | Purpose |
|---|---|
scheduler/placement.rs |
LP + greedy tensor placement across GPU/RAM/NVMe tiers |
compute/inference.rs |
Inference engine: generate_blocking, generate_with_nvme_scheduling, server-oriented load_model / generate_from_loaded |
compute/nvme_backend.rs |
Custom GGML buffer type, pool-based expert/FFN streaming, neuron cache, eval callback |
server/routes.rs |
Axum HTTP handlers for Ollama-compatible API |
profiler/ |
Hardware detection (CPU, GPU, memory bandwidth, NVMe throughput) |
cli/bench.rs |
A/B benchmark harness |
model/tensor_role.rs |
Tensor classification for placement scoring (norms, attention, MoE experts) |
No. Hypura only reads from your SSD during inference β it never writes to it.
SSD wear is caused by write cycles (program/erase cycles on NAND flash cells). Reads do not degrade flash cells. Hypura's entire NVMe I/O path uses read-only pread() calls with F_NOCACHE to stream tensor weights from the GGUF file into RAM/GPU memory pools, where all computation happens. The SSD is used as cold storage, not as working memory.
The only writes Hypura performs are negligible: benchmark result JSON files (KB), co-activation statistics (KB to ~/.hypura/), and the one-time hypura optimize command if you choose to run it. Normal inference generates zero SSD writes.
bench --baseline is blocked when the model exceeds RAM minus 4 GB headroom. Use --force to override at your own risk.--max-tokens 10 on untested models../test-models/ (not checked in).MIT
I feel morally obligated to say I did not write the code in this repository myself. This project is an exploration of using LLMs to carry out tasks based on my direction. The majority of prompts I used to get here were derived using the socratic method, genuine curiosity, and a hunch that NVMe supporting inference is underutilized despite being a (slow but) perfectly valid form of memory.
Still, couldn't one get a RAID 0 card with four drives to saturate a 16x lane? That's already the max one could push through PCIe anyhow.
man 2 madvise
Joke aside (I do have them tho!), I don't think Optane is that much use (not to mention it is only 256GiB for my unit). It is useful legacy crutch if you have legacy software that is not designed to issue multiple reads / writes in parallel. If you do, it is really not faster than NVMe, especially these modern ones.
"overloading NVMe"? What is that about? First time I've heard anything about it.
> because putting a ton of stress on your NVMe during generation
Really shouldn't "stress your NVMe", something is severely wrong if that's happening. I've been hammering my SSDs forever, and while write operations "hurt" the longevity of the flash cells themselves, the controller interface really shouldn't be affected by this at all, unless I'm missing something here.
There is no writing to SSDs on inference with this architecture.