Hypura – A storage-tier-aware LLM inference scheduler for Apple Silicon

For a lot of local workloads, sub-1 tok/s is useless in foreground and perfectly acceptable in background. If the choice is “this crashes” vs “this finishes overnight,” that’s still a meaningful capability jump.

the practical question is whether the read pattern is sequential enough to actually saturate nvme bandwidth or if the attention layer access pattern ends up being random enough to kill throughput. sequential reads on a decent nvme get you 5-7 GB/s, random reads drop to maybe 500 MB/s depending on queue depth.

for a 1T model youd need to stream something like 2TB of weights per forward pass at fp16. even at peak sequential thats 300+ seconds per token which is... not great for interactive use but maybe fine for batch inference where you dont care about latency.

still a cool proof of concept though. the gap between 'can run' and 'runs usefully' is where things get interesting.

Where does "1T parameter model" come from? I can only see models with 70B params or less mentioned in the repo.

Intel Optane rolling in its grave.

This is a pretty cool project! Essentially this is like using Swap memory to extend your RAM, but in a 'smart' way so you don't overload the NVMe unnecessarily.

I do wonder in practice how the 'smarts' pan out, because putting a ton of stress on your NVMe during generation is probably not the best choice for it's longevity.

It will be interesting to compare this to https://news.ycombinator.com/item?id=47476422 and https://news.ycombinator.com/item?id=47490070 . Very similar design except that this is apparently using mmap, which according to the earlier experiment incurs significant overhead.

Are there any 1T parameter open source models?

I am curious how the TPS compares vs default OS virtual memory paging

I wonder how many minutes per token on GLM 5.

This is <1 tok/s for the 40GB model.

Come on, "Run" is not the right word. "Crawl" is.

Headlines like that are misleading.

There needs to be something like this from Ollama. At the moment Ollama has a lot of flaws that prevent it from getting great performance. (My understanding is better GPU/CPU splits, etc). But Ollama is the only way to host an LLM and have it switch out on demand. Sigh.

You do not provide any comparison to llama.cpp with mmap.

You do not explain how any kind of predictor can work for MoE experts.

You do not explain how prediction can even be useful. I can predict the layers used in a dense model (all of them are used in order), but that doesn't help me much. It's still bottlenecked on bandwidth (hint: MoE doesn't change this).

OS paging would be significantly worse here. The kernel's page fault handler is reactive — it doesn't know you're about to read layer 47's FFN weights, so it can't prefetch. You stall on every fault, wait for the 4KB/16KB page to load, then resume. With 80 layers of dense FFN streaming, that's thousands of cold faults per token.

  What makes this approach faster is that the model's access pattern is completely deterministic during         
  inference. You know exactly which tensors are needed next because transformer layers execute sequentially. So
  you can issue large sequential reads and prefetch the next layer while the current one is computing on Metal. 
  The OS page cache can't do that — it has no concept of "layer N+1 comes after layer N."

  For MoE it's even more stark. The OS would page in all 8 experts on the first token that routes to each one,  
  then evict them under memory pressure with LRU, which has no idea that expert 3 fires 10x more often than
  expert 7. The neuron cache here is basically a domain-specific replacement policy.

Simon Willison wrote a good post about Dan Woods’ work on “Autoresearching Apple's "LLM in a Flash" to run Qwen 397B locally”.

[0] https://simonwillison.net/2026/Mar/18/llm-in-a-flash/

I am curious how the TPS compares vs default OS virtual memory paging

Where does "1T parameter model" come from? I can only see models with 70B params or less mentioned in the repo.

I'm referencing it as being possible, however I didn't share benchmarks because candidly the performance would be so slow it would only be useful for very specific tasks over long time horizons. The more practical use cases are less flashy but capable of achieving multiple tokens/sec (ie smaller MoE models where not all experts need to be loaded in memory simultaneously)

Yeah title comes from nowhere in the link. No doubt it's possible but all that matters is speed and we learn nothing of that here...

Are there any 1T parameter open source models?

I wonder how many minutes per token on GLM 5.

You do not provide any comparison to llama.cpp with mmap.

You do not explain how any kind of predictor can work for MoE experts.

This is <1 tok/s for the 40GB model.

Come on, "Run" is not the right word. "Crawl" is.

Headlines like that are misleading.

Could still be useful; maybe for overnight async workloads? Tell your agent research xyz at night and wake up to a report.

Yes, and with virtually zero context, which makes an enormous difference for TTFT on the MoE models.

still a cool proof of concept though. the gap between 'can run' and 'runs usefully' is where things get interesting.

4K random read with a queue depth of 1 on an M1 Max is about 65MB/s.

Yes, definitely agree. It's more of a POC than a functional use case. However, for many smaller MoE models this method can actually be useful and capable of achieving multiple tokens/sec.

> for a 1T model youd need to stream something like 2TB of weights per forward pass

Isn't this missing the point of MoE models completely? MoE inference is sparse, you only read a small fraction of the weights per layer. You still have a problem of each individual expert-layer being quite small (a few MiBs each give or take) but those reads are large enough for the NVMe.

Intel Optane rolling in its grave.

  What makes this approach faster is that the model's access pattern is completely deterministic during         
  inference. You know exactly which tensors are needed next because transformer layers execute sequentially. So
  you can issue large sequential reads and prefetch the next layer while the current one is computing on Metal. 
  The OS page cache can't do that — it has no concept of "layer N+1 comes after layer N."

  For MoE it's even more stark. The OS would page in all 8 experts on the first token that routes to each one,  
  then evict them under memory pressure with LRU, which has no idea that expert 3 fires 10x more often than
  expert 7. The neuron cache here is basically a domain-specific replacement policy.

This is a pretty cool project! Essentially this is like using Swap memory to extend your RAM, but in a 'smart' way so you don't overload the NVMe unnecessarily.

I do wonder in practice how the 'smarts' pan out, because putting a ton of stress on your NVMe during generation is probably not the best choice for it's longevity.

Simon Willison wrote a good post about Dan Woods’ work on “Autoresearching Apple's "LLM in a Flash" to run Qwen 397B locally”.

[0] https://simonwillison.net/2026/Mar/18/llm-in-a-flash/

Memristors are also missing in this AI hype even when they were around the corner 10 years back.

Still have 4 brand new ones in my storage unit. Just in case these moments.

Joke aside (I do have them tho!), I don't think Optane is that much use (not to mention it is only 256GiB for my unit). It is useful legacy crutch if you have legacy software that is not designed to issue multiple reads / writes in parallel. If you do, it is really not faster than NVMe, especially these modern ones.

Is it too late for Intel to bring them back to life?

Wouldn't be Intel if they didn't quit halfway through on a good thing.

Still, couldn't one get a RAID 0 card with four drives to saturate a 16x lane? That's already the max one could push through PCIe anyhow.

> The kernel's page fault handler is reactive — it doesn't know you're about to read layer 47's FFN weights, so it can't prefetch.

man 2 madvise

That assumes you have significant work to do between fetches (so you can prefetch while using the current data). With LLM decode you don't.

Ollama has very substandard support for mmap at present, which hurts inference with larger models. There are some recent pull requests in flight that should help address this to at least some extent https://github.com/ollama/ollama/pull/14525 https://github.com/ollama/ollama/pull/14134 https://github.com/ollama/ollama/pull/14864 but progress seems to be stalling. Their support for recent Qwen models seems to also have some bespoke incompatibilities with llama.cpp, which doesn't help matters; it's difficult to test the same model with both.

llama.cpp and llama-swap do this better than Ollama and with far more control.

This is not putting any stress or wear on the NVMe, it's a pure read workload.

> but in a 'smart' way so you don't overload the NVMe unnecessarily

"overloading NVMe"? What is that about? First time I've heard anything about it.

> because putting a ton of stress on your NVMe during generation

Really shouldn't "stress your NVMe", something is severely wrong if that's happening. I've been hammering my SSDs forever, and while write operations "hurt" the longevity of the flash cells themselves, the controller interface really shouldn't be affected by this at all, unless I'm missing something here.

It was written by an LLM, so... yeah.

Except this isnt using heavily quantised versions of the model thus reducing quality.

Yeah title comes from nowhere in the link. No doubt it's possible but all that matters is speed and we learn nothing of that here...

That model is "open weight", not open source. We have no idea what data Moonshot trained on.

Yes, definitely agree. It's more of a POC than a functional use case. However, for many smaller MoE models this method can actually be useful and capable of achieving multiple tokens/sec.

Yes, and with virtually zero context, which makes an enormous difference for TTFT on the MoE models.

4K random read with a queue depth of 1 on an M1 Max is about 65MB/s.

Could still be useful; maybe for overnight async workloads? Tell your agent research xyz at night and wake up to a report.

Assuming 1 token per second and "overnight" being 12 hours, that's 43 200 tokens. I'm not sure what you can meaningfully achieve with that.

> for a 1T model youd need to stream something like 2TB of weights per forward pass

But across a sequence you still have to load most of them.

 _   _
| | | |_   _ _ __  _   _ _ __ __ _
| |_| | | | | '_ \| | | | '__/ _` |
|  _  | |_| | |_) | |_| | | | (_| |
|_| |_|\__, | .__/ \__,_|_|  \__,_|
       |___/|_|
   Run models too big for your Mac's memory

Hypura is a storage-tier-aware LLM inference scheduler for Apple Silicon. It places model tensors across GPU, RAM, and NVMe tiers based on access patterns, bandwidth costs, and hardware capabilities — enabling models that exceed physical memory to run without crashing the system.

Run a 31 GB Mixtral 8x7B on a 32 GB Mac Mini at 2.2 tok/s. A 40 GB Llama 70B at 0.3 tok/s. Vanilla llama.cpp crashes on both.

Why does this matter?

Consumer hardware (MacBook Pro, Mac Studio) ships with fast unified memory and NVMe storage, but limited capacity. A 32 GB M1 Max cannot naively load a 40 GB model — the OS will swap-thrash until the OOM killer intervenes.

Hypura solves this by understanding the model architecture:

Norms and embeddings are tiny but accessed every token — pinned to GPU
MoE expert routing exploits sparsity — only 2 of 8 experts fire per token. Router interception identifies selected experts in the eval callback, then loads only the needed expert strides from NVMe (75% I/O reduction). A neuron cache tracks loaded expert slices across tokens, achieving 99.5% hit rate from temporal locality. Co-activation tracking predicts which experts will fire next for speculative prefetch.
Dense FFN weights (gate, up, down — ~60% of model size) stream from NVMe through a dynamically-sized pool buffer while attention + norms stay GPU-resident. Prefetch lookahead depth scales automatically with available memory.

The result: models that would crash your machine under naive mmap become runnable. Models that fit in memory run at full Metal GPU speed with zero overhead.

How it works

Hypura reads the GGUF file, profiles your hardware (GPU working set, RAM, NVMe bandwidth), and solves a placement optimization that assigns every tensor to a tier:

GPU (Metal) — Attention layers, norms, embeddings. Fastest access, limited by recommendedMaxWorkingSetSize.
RAM — Overflow layers that don't fit in the GPU working set. Accessed via mmap.
NVMe — Remaining layers loaded on-demand via direct I/O (F_NOCACHE + pread), prefetched ahead of the forward pass.

Hypura selects the best inference mode automatically based on model size, architecture, and available memory:

Full-resident — Model fits in GPU+RAM. No NVMe I/O. Full Metal speed.
Expert-streaming — For MoE models (Mixtral). Only non-expert tensors (~1 GB) stay on GPU. Expert tensors stream from NVMe through a pool buffer on demand, with a neuron cache (99.5% hit rate) that eliminates most I/O after warmup.
Dense FFN-streaming — For dense models too large for GPU (Llama 70B). Attention + norms stay on GPU (~~8 GB). FFN tensors (~~32 GB) stream from NVMe through a dynamically-sized pool buffer, with scaled prefetch lookahead.

Pool buffer size, prefetch depth, and memory budgets are computed automatically from your hardware profile — no manual tuning needed.

Performance

All benchmarks on M1 Max, 32 GB unified memory, ~5.1 GB/s NVMe sequential read.

Model	Size	GPU	NVMe	Mode	Hypura	llama.cpp	Notes
Qwen 2.5 14B Q4_K_M	8.4 GB	8.4 GB	—	full-resident	21 tok/s	~21 tok/s	Fits in GPU; no overhead
Mixtral 8x7B Q5_K_M	30.9 GB	1.1 GB	29.8 GB	expert-streaming	2.2 tok/s	OOM	All layers on Metal; 99.5% cache hit rate
Llama 3.3 70B Q4_K_M	39.6 GB	7.8 GB	31.8 GB	dense-FFN-streaming	0.3 tok/s	OOM	All layers on Metal; dynamic 24-slot pool, 7-layer prefetch

Key takeaway: For models that fit in memory, Hypura adds zero overhead. For models that don't fit, Hypura is the difference between "runs" and "crashes." Expert-streaming on Mixtral achieves usable interactive speeds by keeping only non-expert tensors on GPU and exploiting MoE sparsity (only 2/8 experts fire per token). Dense FFN-streaming extends this to non-MoE models like Llama 70B. Pool sizes and prefetch depth scale automatically with available memory.

Install

Hypura builds from source with Cargo. You'll need Rust 1.75+ and CMake (for the vendored llama.cpp).

git clone --recurse-submodules https://github.com/hypura/hypura.git
cd hypura
cargo build --release

The binary is at target/release/hypura.

Homebrew tap coming soon.

Quick start

# Profile your hardware (runs once, cached)
hypura profile

# Run inference on a GGUF model
hypura run ./model.gguf --prompt "Hello, world"

# Interactive chat
hypura run ./model.gguf --interactive

# Benchmark: Hypura scheduling vs naive baseline
hypura bench ./model.gguf

# Inspect model placement plan without loading
hypura inspect ./model.gguf

Start with --max-tokens 10 on untested models before scaling up.

Ollama-compatible server

Hypura exposes an Ollama-compatible HTTP API, making it a drop-in replacement for any tool that talks to Ollama — including OpenClaw.

hypura serve ./model.gguf
# Hypura serving Mixtral 8x7B Instruct v0.1
#   Endpoint: http://127.0.0.1:8080
#   Ollama-compatible API: /api/generate, /api/chat, /api/tags

Endpoints

Endpoint	Description
`GET /`	Health check
`GET /api/tags`	List loaded model
`GET /api/version`	Server version
`POST /api/show`	Model metadata
`POST /api/generate`	Text completion (streaming NDJSON or single response)
`POST /api/chat`	Chat completion (streaming NDJSON or single response)

Usage with OpenClaw

Point OpenClaw at Hypura by setting the Ollama base URL in ~/.openclaw/openclaw.json:

{
  "models": {
    "providers": {
      "ollama": {
        "baseUrl": "http://127.0.0.1:8080",
        "api": "ollama"
      }
    }
  }
}

Or via the CLI:

openclaw config set models.providers.ollama.baseUrl "http://127.0.0.1:8080"

Hypura speaks native Ollama protocol (/api/chat with NDJSON streaming), so no compatibility shims are needed.

Server options

hypura serve <MODEL> [OPTIONS]

Options:
  --host <HOST>        Host to bind to [default: 127.0.0.1]
  --port <PORT>        Port to bind to [default: 8080]
  --context <N>        Maximum context length [default: 4096]

Architecture

Hypura is a Cargo workspace with two crates:

hypura — Main binary and library. CLI in src/main.rs, all logic in src/lib.rs modules.
hypura-sys — FFI bindings to llama.cpp (vendored at vendor/llama.cpp/, built via CMake).

Key modules

Module	Purpose
`scheduler/placement.rs`	LP + greedy tensor placement across GPU/RAM/NVMe tiers
`compute/inference.rs`	Inference engine: `generate_blocking`, `generate_with_nvme_scheduling`, server-oriented `load_model` / `generate_from_loaded`
`compute/nvme_backend.rs`	Custom GGML buffer type, pool-based expert/FFN streaming, neuron cache, eval callback
`server/routes.rs`	Axum HTTP handlers for Ollama-compatible API
`profiler/`	Hardware detection (CPU, GPU, memory bandwidth, NVMe throughput)
`cli/bench.rs`	A/B benchmark harness
`model/tensor_role.rs`	Tensor classification for placement scoring (norms, attention, MoE experts)

FAQ

Will this kill my SSD?

No. Hypura only reads from your SSD during inference — it never writes to it.

SSD wear is caused by write cycles (program/erase cycles on NAND flash cells). Reads do not degrade flash cells. Hypura's entire NVMe I/O path uses read-only pread() calls with F_NOCACHE to stream tensor weights from the GGUF file into RAM/GPU memory pools, where all computation happens. The SSD is used as cold storage, not as working memory.

The only writes Hypura performs are negligible: benchmark result JSON files (~~KB), co-activation statistics (~~KB to ~/.hypura/), and the one-time hypura optimize command if you choose to run it. Normal inference generates zero SSD writes.

Safety notes

bench --baseline is blocked when the model exceeds RAM minus 4 GB headroom. Use --force to override at your own risk.
Always start with --max-tokens 10 on untested models.
Test models belong in ./test-models/ (not checked in).

License

MIT

Ethics

I feel morally obligated to say I did not write the code in this repository myself. This project is an exploration of using LLMs to carry out tasks based on my direction. The majority of prompts I used to get here were derived using the socratic method, genuine curiosity, and a hunch that NVMe supporting inference is underutilized despite being a (slow but) perfectly valid form of memory.

Memristors are also missing in this AI hype even when they were around the corner 10 years back.

Except this isnt using heavily quantised versions of the model thus reducing quality.

It was written by an LLM, so... yeah.

Wouldn't be Intel if they didn't quit halfway through on a good thing.

Still, couldn't one get a RAID 0 card with four drives to saturate a 16x lane? That's already the max one could push through PCIe anyhow.

That assumes you have significant work to do between fetches (so you can prefetch while using the current data). With LLM decode you don't.

> The kernel's page fault handler is reactive — it doesn't know you're about to read layer 47's FFN weights, so it can't prefetch.

man 2 madvise

Is it too late for Intel to bring them back to life?

Yes, their NAND division has been sold, it is now mostly under solidigm. Maybe solidigm could bring it back, but it seems unlikely (given the previous commercial failure).

Nvidia and SK Hynix are bringing HBF to market for $$.

Still have 4 brand new ones in my storage unit. Just in case these moments.

It's not about being faster (except for small reads where latency dominates, which is actually relevant when reading a handful of expert-layers immediately after routing), it's the wearout resistance which opens up the possibility of storing KV-cache (including the "linear" KV-cache of recent Qwen, which is not append-only as it was with the pure attention model) and maybe even per-layer activations - though this has the least use given how ephemeral these are.

This is not putting any stress or wear on the NVMe, it's a pure read workload.

llama.cpp and llama-swap do this better than Ollama and with far more control.

> but in a 'smart' way so you don't overload the NVMe unnecessarily

"overloading NVMe"? What is that about? First time I've heard anything about it.

> because putting a ton of stress on your NVMe during generation

But across a sequence you still have to load most of them.

Assuming 1 token per second and "overnight" being 12 hours, that's 43 200 tokens. I'm not sure what you can meaningfully achieve with that.

Don't even need to use llama-swap anymore now that llama-server supports the same functionality.

Hypura reads tensor weights from the GGUF file on NVMe into RAM/GPU memory pools, then compute happens entirely in RAM/GPU.

There is no writing to SSDs on inference with this architecture.

People talk about "SSD endurance", but enough parallel I/O on M1/M2 can make the NVMe controller choke, with very weird latncy spikes.

I had assumed heat generation on the controller if it's continuously reading. But maybe it's not actually bad.

Nvidia and SK Hynix are bringing HBF to market for $$.

Yes, their NAND division has been sold, it is now mostly under solidigm. Maybe solidigm could bring it back, but it seems unlikely (given the previous commercial failure).

Don't even need to use llama-swap anymore now that llama-server supports the same functionality.

People talk about "SSD endurance", but enough parallel I/O on M1/M2 can make the NVMe controller choke, with very weird latncy spikes.

That model is "open weight", not open source. We have no idea what data Moonshot trained on.

Hypura reads tensor weights from the GGUF file on NVMe into RAM/GPU memory pools, then compute happens entirely in RAM/GPU.

There is no writing to SSDs on inference with this architecture.

Even if there was a ton of writing, I'm not sure where NVMe even comes in the picture, write durability is about the flash cells on SSDs, nothing to do with the interface, someone correct me if I'm wrong.

I had assumed heat generation on the controller if it's continuously reading. But maybe it's not actually bad.

Just pop a heatsink on it and call it good.

Hacker Times

Hacker Times

Hypura – A storage-tier-aware LLM inference scheduler for Apple Silicon

Discussion

Discussion

Why does this matter?

How it works

Performance

Install

Quick start

Ollama-compatible server

Endpoints

Usage with OpenClaw

Server options

Architecture

Key modules

FAQ

Will this kill my SSD?

Safety notes

License

Ethics