The most interesting part of this idea for me is how it wasn't tried / implemented before, as it makes sense.

I haven't read the paper but of course DTree tricks work here as well

Does this translate into a similar reduction in compute?

What's the catch?

I wonder what our man @antirez will make of this

If someone can make this work with GGUF and Quantized Qwen 3.6 or Deepseek 4 it would greatly help running local models.

I don't understand. This distills a diffusion transformer out of Qwen3. And while the provably identical is nice, a full diffusion transformer would be a lot faster still.

The most interesting part of this idea for me is how it wasn't tried / implemented before, as it makes sense.

I haven't read the paper but of course DTree tricks work here as well

I wonder what our man @antirez will make of this

Does this translate into a similar reduction in compute?

What's the catch?

It is all about moving the bottleneck. During prompt processing everything can be calculated in parallel, while during token generation you create a single token at a time. For example, using an RTX 4000 Ada, I'm getting 2700 t/s for prompt processing, and 48 t/s for token generation using an 8B class model.

Their approach is essentially a speculative decoding approach where multiple tokens are predicted at once and then verified. Therefore getting more tokens to be created at a speed that is closer to the prompt processing speed.

It seems to be special because their approach yields the exact same output distribution as the base model and it only takes a negligable amount of additional memory.

The main catch is that if your prompt processing speed is already bad, it will not help you all that much.

For example, the M-series Macs (up to M4) have a relative high generation speed compared to their prompt processing speed. That means they will not benefit as much (if at all). With the M5 the prompt processing speed has increased 4x, so those can expect to see a good uplift.

> Does this translate into a similar reduction in compute?

No, quite the opposite actually. Like with speculative decoding this model will compute more tokens and discard the invalid ones.

> What's the catch?

LLMs[1] are limited by memory latency and not by compute[2]: because they process tokens one at a time, you spend more time loading and unloading the weights on the GPU registers from VRAM than waiting for compute to happen. Techniques like these allow to process multiple tokens in parallel instead of one by one, and as such exploit better the compute of your graphic card. They do so by predicting which tokens are likely to occur and then verifying that the guess was correct.

For instance if the previous token is “hello”.

A regular autoregressive LLM will compute:

“hello” => “! ”,

then “hello! ” => “how ”,

“hello! how ” => “are ”,

“hello! how are ” => “you”.

and finally “hello! how are you” => “?<end>”

One at a time. Loading and unloading every weights 5 times from the GPU memory to its compute units.

With speculative decoding (I'd say this one isn't strictly speculative decoding, but it's a variant of the same principle), you have something that guesses that the whole sentence is going to be “how are you today?”, so the LLM can generate

“hello” => “! ”,

“hello! ” => “how ”,

“hello! how ” => “are ”,

“hello! how are ” => “you”.

“hello! how are you” => “?<end>”

“hello! how are you today” => “?<end>”

In parallel. So each weight would have been loaded only once from the VRAM instead of 5.

The last token will be discarded though, as the prefix “how are you today” doesn't match what has actually been generated. So in that particular example, you'd have gotten your 5 tokens 5 times faster than with pure autoregressive inference, but at the expense of a 6th token being generated and discarded immediately. So 5 times more token throughtput, but 20% compute cost increase per token.

[1]: autoregressive LLMs, that is. Which are the ones everybody uses because they are the most performant.

[2]: at least when run at low batch size, on your own computer for your personal use. On a datacenter, with many concurrent users, GPUs are actually compute-bound.

I don't understand. This distills a diffusion transformer out of Qwen3. And while the provably identical is nice, a full diffusion transformer would be a lot faster still.

A full diffusion transformer would need more forward passes (thus being slower) or produce worse output (because it can't properly account for dependencies between tokens when generating them independently in parallel), or both. Keeping the output identical to the autoregressive baseline ensures the speedup doesn't come at the cost of quality degradation.

If someone can make this work with GGUF and Quantized Qwen 3.6 or Deepseek 4 it would greatly help running local models.

Multi-token prediction is available now, I'm still getting it set up but it sounds like it'll be 1.5x or 2x on the bigger models.

It seems to be special because their approach yields the exact same output distribution as the base model and it only takes a negligable amount of additional memory.

The main catch is that if your prompt processing speed is already bad, it will not help you all that much.

> multiple tokens are predicted at once and then verified

Reminds me a little of a carry lookahead adder.

> Does this translate into a similar reduction in compute?

No, quite the opposite actually. Like with speculative decoding this model will compute more tokens and discard the invalid ones.

> What's the catch?

For instance if the previous token is “hello”.

A regular autoregressive LLM will compute:

“hello” => “! ”,

then “hello! ” => “how ”,

“hello! how ” => “are ”,

“hello! how are ” => “you”.

and finally “hello! how are you” => “?<end>”

One at a time. Loading and unloading every weights 5 times from the GPU memory to its compute units.

“hello” => “! ”,

“hello! ” => “how ”,

“hello! how ” => “are ”,

“hello! how are ” => “you”.

“hello! how are you” => “?<end>”

“hello! how are you today” => “?<end>”

In parallel. So each weight would have been loaded only once from the VRAM instead of 5.

[1]: autoregressive LLMs, that is. Which are the ones everybody uses because they are the most performant.

[2]: at least when run at low batch size, on your own computer for your personal use. On a datacenter, with many concurrent users, GPUs are actually compute-bound.

Minor nit re[2]: for agentic workloads that are actually worth money - i.e., claude code and similar, things are either prefill-bound - which this does not help - or more importantly tps/user bound (at 150k+ context windows) - you want your big magic model to emit 200 tps/user. This is why Nvidia bought Groq (now LPU) and what Cerebras is trying to do, etc, etc. So for the stuff that makes money in the field - GPUs are not really compute bound once context lengths are large - but still memory transfer bound (may be KV-cache transfer, may be HBM->SRAM-on-chip, etc..)

Fantastic results. Well done. ...So this is built into the way the model works.. if I'm understanding it correctly.

I was wondering what would be involved in getting it to work with GGUF files, rather than safetensor files...

Multi-token prediction is available now, I'm still getting it set up but it sounds like it'll be 1.5x or 2x on the bigger models.

I've tried MTP, and that got me about 1.5x on average with a very spec friendly benchmark.

I didn't run the full benchmark with the demo code, just picked up a single prompt from it. The prompt is about 1300 token, the response is about 3200 token.

Baseline: 44.8 t/s With Orthrus: 164.6 t/s

Note: Don't use the `use_diffusion_mode=` config flag in their example to collect a baseline. Something about how the fallback to "normal" makes it grind to a crawl.

Orthrus logo

Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

Official implementation and model checkpoints for Orthrus, a dual-architecture framework that unifies the exact generation fidelity of autoregressive Large Language Models (LLMs) with the high-speed parallel token generation of diffusion models.

Orthrus Architecture

https://github.com/user-attachments/assets/2a0b021c-e232-4ac6-bf5c-c582c422505e

Model Zoo

All models use a Qwen3 backbone and guarantee strictly lossless generation.

Model	Base Model	HuggingFace	Avg. Speedup
Orthrus-Qwen3-1.7B	Qwen3-1.7B	🤗 HuggingFace	4.25×
Orthrus-Qwen3-4B	Qwen3-4.0B	🤗 HuggingFace	5.20×
Orthrus-Qwen3-8B	Qwen3-8.0B	🤗 HuggingFace	5.36×

Installation

uv pip install -e .
uv pip install ninja packaging
uv pip install flash-attn --no-build-isolation # or: pip install "flash-attn-4[cu13]" if your device supports it

We recommend uv for fast dependency resolution.

Quickstart

⚡ Try instantly: Run Orthrus directly in Colab:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
 

model = AutoModelForCausalLM.from_pretrained(
    "chiennv/Orthrus-Qwen3-8B",
    dtype=torch.bfloat16, device_map="cuda",
    attn_implementation="flash_attention_2",  # options: sdpa | eager | flash_attention_4
    trust_remote_code=True,
).eval()
tokenizer = AutoTokenizer.from_pretrained("chiennv/Orthrus-Qwen3-8B")
 
prompt = "Write a program to count the frequency of each word in a paragraph."
messages = [{"role": "system", "content": ""}, {"role": "user", "content": prompt}]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True, enable_thinking=False).input_ids

output_ids = model.generate(
    input_ids=input_ids.to(model.device), 
    max_new_tokens=2048,
    use_diffusion_mode=True, 
    streamer=TextStreamer(tokenizer, skip_prompt=True) # enable streaming generation
)

Coming soon: Native integration with vLLM and SGLang is coming soon. Stay tuned!

Key Advantages

Significant Inference Acceleration: Breaks the sequential bottleneck of standard autoregressive decoding, delivering up to a $7.8\times$ speedup on generation tasks.
Strictly Lossless Generation: Employs an exact intra-model consensus mechanism to guarantee that the output matches the original base model's exact predictive distribution.
Zero Redundant Memory Overhead: Both the autoregressive and diffusion views attend to the exact same high-fidelity Key-Value (KV) cache natively, resulting in only an $O(1)$ memory cache overhead.
Parameter Efficient: Parallel generation capabilities are injected by fine-tuning only 16% of the total model parameters while keeping the base LLM strictly frozen.

Performance Comparison: Orthrus vs. Speculative Decoding

Orthrus outperforms speculative decoding methods like EAGLE-3, DFlash. By natively sharing the exact same KV cache across dual views, Orthrus avoids the redundant memory overhead of draft models, resulting in significantly higher token acceptance rates and faster inference times, especially as context length scales. Orthrus maintains consistently high end-to-end throughput—even at 40K context lengths compared to DFlash's rapid degradation.

Average Acceptance Length Comparison Long Context Throughput Benchmark

Left: Average verified tokens per forward pass compared to EAGLE-3 and DFlash. Right: End-to-end throughput across scaling context lengths.

Comparison with State-of-the-Art Diffusion Models

While recent diffusion language models (dLLMs) offer parallel decoding, they often suffer from significant conditional drift and severe accuracy degradation on complex reasoning tasks. Orthrus resolves this by decoupling parallel generation from sequential constraints, establishing a new state-of-the-art for parallel generation fidelity.

Throughput vs. Accuracy on MATH-500

Throughput vs. Accuracy on MATH-500. Orthrus delivers a ~6x speedup over the Qwen3-8B baseline with strictly lossless performance, whereas adaptations like Fast-dLLM-v2 suffer significant accuracy drops.

Further Support

MLX (Apple Silicon)

Orthrus supports native inference on Apple Silicon via MLX. Tested with mlx==0.31.2 and mlx-lm==0.31.3.

Usage:

from src.model_mlx import load_model_and_tokenizer, mlx_generate
 
repo_id = "chiennv/Orthrus-Qwen3-1.7B"
model, tokenizer = load_model_and_tokenizer(repo_id)
 
prompt_tokens = tokenizer.encode("If a rectangle has length 12 and width 7, what is its area?")
 
for token in mlx_generate(model, prompt_tokens, tokenizer.eos_token_id, max_tokens=128):
    print(tokenizer.decode([token]), end="", flush=True)

Citation

If you find this model or architecture useful in your work, please cite our paper:

@misc{vannguyen2026orthrusmemoryefficientparalleltoken,
      title={Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion}, 
      author={Chien Van Nguyen and Chaitra Hegde and Van Cuong Pham and Ryan A. Rossi and Franck Dernoncourt and Thien Huu Nguyen},
      year={2026},
      eprint={2605.12825},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.12825}, 
}

I've tried MTP, and that got me about 1.5x on average with a very spec friendly benchmark.

I didn't run the full benchmark with the demo code, just picked up a single prompt from it. The prompt is about 1300 token, the response is about 3200 token.

Baseline: 44.8 t/s With Orthrus: 164.6 t/s

Note: Don't use the `use_diffusion_mode=` config flag in their example to collect a baseline. Something about how the fallback to "normal" makes it grind to a crawl.

> multiple tokens are predicted at once and then verified

Reminds me a little of a carry lookahead adder.

More like speculative prefetch I'd think

> i.e., claude code and similar, things are either prefill-bound

When accounting for prefix caching, this greatly accelerates each turn. Barring large file reads, prefill still isn't the bottleneck vs. decoding reasoning tokens. Script-writing too.

This is especially true during exploration phases when traversing through directory trees and grepping files, you're talking about a few hundred tokens/turn.

Fantastic results. Well done. ...So this is built into the way the model works.. if I'm understanding it correctly.

I was wondering what would be involved in getting it to work with GGUF files, rather than safetensor files...

Just to get it into a GGUF file would be fairly trivial. But using that GGUF file would need a bunch of additional things. One would need to create a new architecture derived from Qwen3, and then probably adapt the speculative decoding functionality.

At the moment not even MTP is merged into llama.cpp, so I wouldn't quite hold my breath for it.

> i.e., claude code and similar, things are either prefill-bound

When accounting for prefix caching, this greatly accelerates each turn. Barring large file reads, prefill still isn't the bottleneck vs. decoding reasoning tokens. Script-writing too.

This is especially true during exploration phases when traversing through directory trees and grepping files, you're talking about a few hundred tokens/turn.

More like speculative prefetch I'd think

At the moment not even MTP is merged into llama.cpp, so I wouldn't quite hold my breath for it.

I thought that might be the case. I naively wondered. I'll see if I can understand the paper :-)

Hope the paper gets lots of references and the technique gets a lot of use to save power and time.

There's been several potential big changes for LLM inference efficiency over the last few months. There's been Attention Sequencing (I think it's called..?) Turbo Quant and this one.

Interesting times.

MTP merged today, a couple of hours after your post by the looks of things.

I thought that might be the case. I naively wondered. I'll see if I can understand the paper :-)

Hope the paper gets lots of references and the technique gets a lot of use to save power and time.

There's been several potential big changes for LLM inference efficiency over the last few months. There's been Attention Sequencing (I think it's called..?) Turbo Quant and this one.

Interesting times.

MTP merged today, a couple of hours after your post by the looks of things.

By the looks of it, it will take a couple more follow up PRs to clean things up a bit and get the most performance from MTP. I hope that by that point it will be easier to add more spec decoding types.

In the meantime I've benchmarked Orthrus some more and got some quite promising results. So I'd be glad if my prediction that it may take some time until it lands in llama.cpp turns out to be wrong.

Hacker Times

Hacker Times

Orthrus-Qwen3: up to 7.8×tokens/forward on Qwen3, identical output distribution

Discussion

Discussion

Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

Model Zoo

Installation

Quickstart

Key Advantages

Performance Comparison: Orthrus vs. Speculative Decoding

Comparison with State-of-the-Art Diffusion Models

Further Support

MLX (Apple Silicon)

Citation