I haven't read the paper but of course DTree tricks work here as well
Their approach is essentially a speculative decoding approach where multiple tokens are predicted at once and then verified. Therefore getting more tokens to be created at a speed that is closer to the prompt processing speed.
It seems to be special because their approach yields the exact same output distribution as the base model and it only takes a negligable amount of additional memory.
The main catch is that if your prompt processing speed is already bad, it will not help you all that much.
For example, the M-series Macs (up to M4) have a relative high generation speed compared to their prompt processing speed. That means they will not benefit as much (if at all). With the M5 the prompt processing speed has increased 4x, so those can expect to see a good uplift.
No, quite the opposite actually. Like with speculative decoding this model will compute more tokens and discard the invalid ones.
> What's the catch?
LLMs[1] are limited by memory latency and not by compute[2]: because they process tokens one at a time, you spend more time loading and unloading the weights on the GPU registers from VRAM than waiting for compute to happen. Techniques like these allow to process multiple tokens in parallel instead of one by one, and as such exploit better the compute of your graphic card. They do so by predicting which tokens are likely to occur and then verifying that the guess was correct.
For instance if the previous token is “hello”.
A regular autoregressive LLM will compute:
“hello” => “! ”,
then “hello! ” => “how ”,
“hello! how ” => “are ”,
“hello! how are ” => “you”.
and finally “hello! how are you” => “?<end>”
One at a time. Loading and unloading every weights 5 times from the GPU memory to its compute units.
With speculative decoding (I'd say this one isn't strictly speculative decoding, but it's a variant of the same principle), you have something that guesses that the whole sentence is going to be “how are you today?”, so the LLM can generate
“hello” => “! ”,
“hello! ” => “how ”,
“hello! how ” => “are ”,
“hello! how are ” => “you”.
“hello! how are you” => “?<end>”
“hello! how are you today” => “?<end>”
In parallel. So each weight would have been loaded only once from the VRAM instead of 5.
The last token will be discarded though, as the prefix “how are you today” doesn't match what has actually been generated. So in that particular example, you'd have gotten your 5 tokens 5 times faster than with pure autoregressive inference, but at the expense of a 6th token being generated and discarded immediately. So 5 times more token throughtput, but 20% compute cost increase per token.
[1]: autoregressive LLMs, that is. Which are the ones everybody uses because they are the most performant.
[2]: at least when run at low batch size, on your own computer for your personal use. On a datacenter, with many concurrent users, GPUs are actually compute-bound.
Official implementation and model checkpoints for Orthrus, a dual-architecture framework that unifies the exact generation fidelity of autoregressive Large Language Models (LLMs) with the high-speed parallel token generation of diffusion models.
https://github.com/user-attachments/assets/2a0b021c-e232-4ac6-bf5c-c582c422505e
All models use a Qwen3 backbone and guarantee strictly lossless generation.
| Model | Base Model | HuggingFace | Avg. Speedup |
|---|---|---|---|
| Orthrus-Qwen3-1.7B | Qwen3-1.7B | 🤗 HuggingFace | 4.25× |
| Orthrus-Qwen3-4B | Qwen3-4.0B | 🤗 HuggingFace | 5.20× |
| Orthrus-Qwen3-8B | Qwen3-8.0B | 🤗 HuggingFace | 5.36× |
uv pip install -e .
uv pip install ninja packaging
uv pip install flash-attn --no-build-isolation # or: pip install "flash-attn-4[cu13]" if your device supports it
We recommend
uvfor fast dependency resolution.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
model = AutoModelForCausalLM.from_pretrained(
"chiennv/Orthrus-Qwen3-8B",
dtype=torch.bfloat16, device_map="cuda",
attn_implementation="flash_attention_2", # options: sdpa | eager | flash_attention_4
trust_remote_code=True,
).eval()
tokenizer = AutoTokenizer.from_pretrained("chiennv/Orthrus-Qwen3-8B")
prompt = "Write a program to count the frequency of each word in a paragraph."
messages = [{"role": "system", "content": ""}, {"role": "user", "content": prompt}]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True, enable_thinking=False).input_ids
output_ids = model.generate(
input_ids=input_ids.to(model.device),
max_new_tokens=2048,
use_diffusion_mode=True,
streamer=TextStreamer(tokenizer, skip_prompt=True) # enable streaming generation
)
Coming soon: Native integration with vLLM and SGLang is coming soon. Stay tuned!
Orthrus outperforms speculative decoding methods like EAGLE-3, DFlash. By natively sharing the exact same KV cache across dual views, Orthrus avoids the redundant memory overhead of draft models, resulting in significantly higher token acceptance rates and faster inference times, especially as context length scales. Orthrus maintains consistently high end-to-end throughput—even at 40K context lengths compared to DFlash's rapid degradation.
Left: Average verified tokens per forward pass compared to EAGLE-3 and DFlash. Right: End-to-end throughput across scaling context lengths.
While recent diffusion language models (dLLMs) offer parallel decoding, they often suffer from significant conditional drift and severe accuracy degradation on complex reasoning tasks. Orthrus resolves this by decoupling parallel generation from sequential constraints, establishing a new state-of-the-art for parallel generation fidelity.
Throughput vs. Accuracy on MATH-500. Orthrus delivers a ~6x speedup over the Qwen3-8B baseline with strictly lossless performance, whereas adaptations like Fast-dLLM-v2 suffer significant accuracy drops.
Orthrus supports native inference on Apple Silicon via MLX. Tested with mlx==0.31.2 and mlx-lm==0.31.3.
Usage:
from src.model_mlx import load_model_and_tokenizer, mlx_generate
repo_id = "chiennv/Orthrus-Qwen3-1.7B"
model, tokenizer = load_model_and_tokenizer(repo_id)
prompt_tokens = tokenizer.encode("If a rectangle has length 12 and width 7, what is its area?")
for token in mlx_generate(model, prompt_tokens, tokenizer.eos_token_id, max_tokens=128):
print(tokenizer.decode([token]), end="", flush=True)
If you find this model or architecture useful in your work, please cite our paper:
@misc{vannguyen2026orthrusmemoryefficientparalleltoken,
title={Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion},
author={Chien Van Nguyen and Chaitra Hegde and Van Cuong Pham and Ryan A. Rossi and Franck Dernoncourt and Thien Huu Nguyen},
year={2026},
eprint={2605.12825},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2605.12825},
}
I didn't run the full benchmark with the demo code, just picked up a single prompt from it. The prompt is about 1300 token, the response is about 3200 token.
Baseline: 44.8 t/s With Orthrus: 164.6 t/s
Note: Don't use the `use_diffusion_mode=` config flag in their example to collect a baseline. Something about how the fallback to "normal" makes it grind to a crawl.
Reminds me a little of a carry lookahead adder.
I was wondering what would be involved in getting it to work with GGUF files, rather than safetensor files...
When accounting for prefix caching, this greatly accelerates each turn. Barring large file reads, prefill still isn't the bottleneck vs. decoding reasoning tokens. Script-writing too.
This is especially true during exploration phases when traversing through directory trees and grepping files, you're talking about a few hundred tokens/turn.
At the moment not even MTP is merged into llama.cpp, so I wouldn't quite hold my breath for it.
Hope the paper gets lots of references and the technique gets a lot of use to save power and time.
There's been several potential big changes for LLM inference efficiency over the last few months. There's been Attention Sequencing (I think it's called..?) Turbo Quant and this one.
Interesting times.
In the meantime I've benchmarked Orthrus some more and got some quite promising results. So I'd be glad if my prediction that it may take some time until it lands in llama.cpp turns out to be wrong.