Introspective Diffusion Language Models

If I’m reading this right, this is pretty wild. They turned a Qwen autoregressor into a diffuser by using a bunch of really clever techniques, and they vastly outperform any “native diffuser,” actually being competitive with the base model they were trained from. The obvious upside here is the massive speedup in generation.

And then through a LoRA adapter, you can ground the diffuser on the base model’s distribution (essentially have it “compare” its proposals against what the base model would’ve generated), which effectively means: exact same byte-for-byte output for the same seed, just roughly twice as fast (which should improve even more for batched tasks).

I’m not an expert, more of a “practicing enthusiast,” so I might be missing something, but at first glance, this reads super exciting to me.

Is anyone here experimenting seriously with Diffusion for text generation? I’d love to learn about your experiences!

I'm no expert (just a monkey... ;) ), but isn't Diffusion supposed to generate ALL of the output at once? From their diagram, it looks like their I-LDM model seems to use previously generated context to generate the next tokens (or blocks).

Does this mean I should switch to sglang? How hard is it to add the capability for these type of models to vLLM? Or does it already handle them?

> 2025-04-12: Initial code release with training and inference support.

> 2025-04-12: Released I-DLM-8B, I-DLM-32B, and I-DLM-8B-LoRA on HuggingFace.

Is this old already? Not saying that's a bad thing, since it seems very sophisticated. Just curious if there's an update

3倍向上したとこのとですが、ボトルネックはMemory BandwidthからComputeに移行したの？それともMemory Bandwidthが支配的ですか？

Can diffusion models have reasoning steps where they generate a block, introspect and then generate another until the output is satisfactory?

So can you just use this and have a faster Qwen32b?

https://huggingface.co/yifanyu/I-DLM-32B/tree/main

Is anyone here experimenting seriously with Diffusion for text generation? I’d love to learn about your experiences!

It's being explored right now for speculative decoding in the local-LLM space, which I think is quite interesting as a use-case

https://www.emergentmind.com/topics/dflash-block-diffusion-f...

https://www.inceptionlabs.ai/

This startup seems to have been at it a while.

From our look into it - amazing speed, but challenges remain around time-to-first-token user experience and overall answer quality.

Can absolutely see this working if we can get the speed and accuracy up to that “good enough” position for cheaper models - or non-user facing async work.

One other question I’ve had is wondering if it’s possible to actually set a huge amount of text to diffuse as the output - using a larger body to mechanically force greater levels of reasoning. I’m sure there’s some incredibly interesting research taking place in the big labs on this.

I have. It requires a distinct intuition compared to a normal language model. Very well suited to certain problems.

It's being explored right now for speculative decoding in the local-LLM space, which I think is quite interesting as a use-case

https://www.emergentmind.com/topics/dflash-block-diffusion-f...

https://www.inceptionlabs.ai/

This startup seems to have been at it a while.

From our look into it - amazing speed, but challenges remain around time-to-first-token user experience and overall answer quality.

Can absolutely see this working if we can get the speed and accuracy up to that “good enough” position for cheaper models - or non-user facing async work.

The overall speed rather than TTFT might start to be more relevant as the caller moves from being a human to another model.

However quality is really important. I tried that site and clicked one of their examples, "create a javascript animation". Fast response, but while it starts like this

``` Below is a self‑contained HTML + CSS + JavaScript example that creates a simple, smooth animation: a colorful ball bounces around the browser window while leaving a fading trail behind it.

<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>JavaScript Bounce Animation</title> <style> body, html { margin: 0; padding: 0;

```

the answer then degrades to

``` radius: BALL_RADIUS, color: BALL_COLOR, traivD O] // array of previous {x,y} positions }; ```

Then more things start creeping in

``` // 3⃣ Bounce off walls if (ball.G 0 ball.radius < 0 || ball.x + ball.radius > _7{nas.width) { ball.vx *= -1; ibSl.x = Math.max(ball.radius, Math.min(ball.x, canvbbF4idth - ball.radius)); } if

```

and the more it goes on the worse it gets

``` Ho7 J3 Works 0 Atep | Description | ```

and

``` • prwrZ8}E6on 5 jdF wVuJg Ar touc> 2ysteners ,2 Ppawn \?) balls w>SFu the 8b$] cliM#]9 ```

This is for the demo on the front page, so I expect this is a pretty good outcome compared to what else you might ask.

I have. It requires a distinct intuition compared to a normal language model. Very well suited to certain problems.

Can you tell us more?

I’m not an expert, more of a “practicing enthusiast,” so I might be missing something, but at first glance, this reads super exciting to me.

Can you tell us more?

I think your excitement is justified. The paper is claiming a serious bridge between AR quality and parallel decoding, and the lossless LoRA-assisted mode is the wildest part.

I don't understand how you can compare against the base model output without generating with the base model, in which case what's the point?

Eh. There is nothing diffusion about this. Nothing to do with denoising. This setup is still purely causal, making it quite a dishonest framing IMO. There is no more introspection here than what happens in MTP + SD setups.

Let me explain what is going on here. This is basically a form of multi-token prediction. And speculative decoding in inference. See my earlier post[1] to understand what that is. TL;DR, in multi-token prediction you train separate LM heads to predict the next as well as next to next token as well as... Upto chosen next kth token. Training multiple LM heads is expensive and can be unnecessary, so what people typically do is have a common base for all the k heads, explained further in [1]. These guys do another variant.

Here is what they do mechanically, given a sequence p consisting of five tokens PE([p1, p2, p3, p4, p5]). Where PE(.) adds relative position info to each token.

1. Create an augmented sequence PE([p1 MASK MASK MASK MASK]). Do a training pass on that, with the ground truth sequence p1..5. Here it is trained to, for example, to predict p3 given p1+pos=-2 MASK+pos=-1 MASK+pos=0, loosely notating.

2. Then separately[2], train it as usual on PE([p1 p2 p3 p4 p5]).

Step (1) teaches it to do multi-token prediction, essentially the single LM head will (very very loosely speaking) condition on the position `k` of the special MASK token and "route" it to the "implicit" k'th LM head.

Step (2) teaches it to be a usual LLM and predict the next token. No MASK tokens involved.

So far, you have trained a multi-token predictor.

Now during inference

You use this for speculative decoding. You generate 5 tokens ahead at once with MASK tokens. And then you run that sequence through the LLM again. This has the same benefits as usual speculative decoding, namely that you can do matrix-matrix multiplication as opposed to matrix-vector. The former is more memory-bandwidth efficient due to higher arithmetic intensity.

here is an example,

query = ["what", "is", "2+2"]) prompt = PE([...query, MASK*5]) you run output = LLM(prompt). Say output is ["what", "is", "2+2", "it", "is", "4"]. Note that the NN is trained to predict the kth next token when faced with positionally encoded MASK tokens. So you get all 5 in one go. To be precise, it learns to predict "4" given ["what", "is", "2+2", MASK, MASK]. Since it does not need the "it" and "is" explicitly, you can do it in parallel with generating the "it" and the "is". "is" is predicted given ["what", "is", "2+2", MASK], for example, and that also doesn't depend on the explicit "it" being there, and thus can also be done in parallel with generating "it", which is just normal generating the next token given the query. And then you use this as a draft in your speculative decoding setup.

Their claim is that using a multi-token predictor this way as a draft model works really well. To be clear, this is still causal, the reason diffusion models have hype is because they are capable of global refinement. This is not. In the same thread as [1], I explain how increasing the number of MASK tokens, i.e increasing `k`, i.e the number of tokens you predict at once in your multi-token prediction setup quickly leads to poor quality. This paper agrees with that. They try out k=2,3,4,8. They see a drop in quality at 8 itself. So finally, this is 4-token-prediction with self-speculative decoding(sans LayerSkip or such), removing seemingly no existing limitation of such setups. It is definitely an interesting way to train MTP though.

[1] https://news.ycombinator.com/item?id=45221692

[2] Note that it is computationally a single forward pass. Attention masks help you fuse steps 1 and 2 into a single operation. However, you still have 2 separate loss values.

Block auto regressive generation can give you big speedups.

Consider that outputting two tokens at a time will be a (2-epsilon)x speedup over running one token at a time. As your block size increases, you quickly get to fast enough that it doesn't matter sooooo much whether you're doing blocks or actual all-at-once generation. What matters, then, is there quality trade-off for moving to block-mode output. And here it sounds like they've minimized that trade-off.

The overall speed rather than TTFT might start to be more relevant as the caller moves from being a human to another model.

However quality is really important. I tried that site and clicked one of their examples, "create a javascript animation". Fast response, but while it starts like this

``` Below is a self‑contained HTML + CSS + JavaScript example that creates a simple, smooth animation: a colorful ball bounces around the browser window while leaving a fading trail behind it.

<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>JavaScript Bounce Animation</title> <style> body, html { margin: 0; padding: 0;

```

the answer then degrades to

``` radius: BALL_RADIUS, color: BALL_COLOR, traivD O] // array of previous {x,y} positions }; ```

Then more things start creeping in

``` // 3⃣ Bounce off walls if (ball.G 0 ball.radius < 0 || ball.x + ball.radius > _7{nas.width) { ball.vx *= -1; ibSl.x = Math.max(ball.radius, Math.min(ball.x, canvbbF4idth - ball.radius)); } if

```

and the more it goes on the worse it gets

``` Ho7 J3 Works 0 Atep | Description | ```

and

``` • prwrZ8}E6on 5 jdF wVuJg Ar touc> 2ysteners ,2 Ppawn \?) balls w>SFu the 8b$] cliM#]9 ```

This is for the demo on the front page, so I expect this is a pretty good outcome compared to what else you might ask.

Weird; I clicked through out of curiosity and didn't get any corruption of the sort in the end result.

I also asked it some technical details about how diffusion LLMs could work and it provided grammatically-correct plausible answers in a very short time (I don't know the tech to say if it's correct or not).

Weird; I clicked through out of curiosity and didn't get any corruption of the sort in the end result.

Does this mean I should switch to sglang? How hard is it to add the capability for these type of models to vLLM? Or does it already handle them?

> 2025-04-12: Initial code release with training and inference support.

> 2025-04-12: Released I-DLM-8B, I-DLM-32B, and I-DLM-8B-LoRA on HuggingFace.

Is this old already? Not saying that's a bad thing, since it seems very sophisticated. Just curious if there's an update

It's clearly a typo on the year, April 12 was two days ago, a quick check in HuggingFace shows that they were uploaded 5 days ago.

3倍向上したとこのとですが、ボトルネックはMemory BandwidthからComputeに移行したの？それともMemory Bandwidthが支配的ですか？

So can you just use this and have a faster Qwen32b?

https://huggingface.co/yifanyu/I-DLM-32B/tree/main

This translates to

> I understand it improved by 3x, but has the bottleneck shifted from Memory Bandwidth to Compute? Or is Memory Bandwidth still dominant?

But why did you post your comment in Japanese? We have so many good options for automated translation nowadays!

でも、なぜ日本語でコメントを投稿したんですか？最近は自動翻訳の良い選択肢がたくさんあるのに！

I'm not in on the joke, can someone ELI5

Can diffusion models have reasoning steps where they generate a block, introspect and then generate another until the output is satisfactory?

Well, you can take the output of a first pass and pass it back through the model like AR “reasoning” models do at inference time.

I don't understand how you can compare against the base model output without generating with the base model, in which case what's the point?

Because the nature of transformers is that running a bunch of pregenerated tokens through them is a parallel operation, not autoregressive. That's how it works at training time, but speculative decoding uses it at inference time. So if you just want to check whether a set of known tokens is "likely" given the base model, you can run them all through and get probability distributions, no need to sample.

It's the same reason there's a difference in speed between "prompt processing" and "generation". The former is just taking the pre-generated prompt and building the KV cache, which is parallel, not autoregressive and therefore way faster.

I haven't read TFA yet but a common technique is speculative decoding where a fast draft model will generate X tokens, which are then verified by the larger target model. The target model may accept some Y <= X tokens but the speedup comes from the fact that this can be done in parallel as a prefill operation due to the nature of transformers.

So let's say a draft model generates 5 tokens, all 5 of these can be verified in parallel with a single forward pass of the target model. The target model may only accept the first 4 tokens (or whatever) but as long as the 5 forward passes of the draft model + 1 prefill of the target model is faster than 4 forward passes of the target, you will have a speedup while maintaining the exact output distribution as the target.

Same reason why prompt processing is faster than text generation.

When you already know the tokens ahead of time you can calculate the probabilities of all tokens batched together, incurring significant bandwidth savings. This won't work if you're already compute bound so people with macs/etc. won't get as much benefits from this.

You would only use the base model during training. This is a distillation technique

Isn't that exactly how draft models speed up inference, though? Validating a batch of tokens is significantly faster than generating them.

presumably that happens at training time?

then once successfully trained you get faster inference from just the diffusion model

I think your excitement is justified. The paper is claiming a serious bridge between AR quality and parallel decoding, and the lossless LoRA-assisted mode is the wildest part.

Block auto regressive generation can give you big speedups.

Here is what they do mechanically, given a sequence p consisting of five tokens PE([p1, p2, p3, p4, p5]). Where PE(.) adds relative position info to each token.

2. Then separately[2], train it as usual on PE([p1 p2 p3 p4 p5]).

Step (2) teaches it to be a usual LLM and predict the next token. No MASK tokens involved.

So far, you have trained a multi-token predictor.

Now during inference

here is an example,

[1] https://news.ycombinator.com/item?id=45221692

[2] Note that it is computationally a single forward pass. Attention masks help you fuse steps 1 and 2 into a single operation. However, you still have 2 separate loss values.

It's clearly a typo on the year, April 12 was two days ago, a quick check in HuggingFace shows that they were uploaded 5 days ago.

This translates to

> I understand it improved by 3x, but has the bottleneck shifted from Memory Bandwidth to Compute? Or is Memory Bandwidth still dominant?

But why did you post your comment in Japanese? We have so many good options for automated translation nowadays!

でも、なぜ日本語でコメントを投稿したんですか？最近は自動翻訳の良い選択肢がたくさんあるのに！

Native Japanese speaker here.

The original Japanese comment is clearly machine translated from another language to English. @Openpic is trolling.

I'd just downvote.

Well, you can take the output of a first pass and pass it back through the model like AR “reasoning” models do at inference time.

Yes and has this been tried?

I'm not in on the joke, can someone ELI5

Perhaps there is none.

I'm not a native English speaker and every now and then I see a comment in my mother tongue (downvoted to all hell of course). It's usually some kind of offhand remark.

Isn't that exactly how draft models speed up inference, though? Validating a batch of tokens is significantly faster than generating them.

You would only use the base model during training. This is a distillation technique

Same reason why prompt processing is faster than text generation.

presumably that happens at training time?

then once successfully trained you get faster inference from just the diffusion model

Native Japanese speaker here.

The original Japanese comment is clearly machine translated from another language to English. @Openpic is trolling.

I'd just downvote.

Perhaps there is none.

I'm not a native English speaker and every now and then I see a comment in my mother tongue (downvoted to all hell of course). It's usually some kind of offhand remark.

Yes and has this been tried?

Yes, Mercury 2 is a reasoning model [0].

[0] https://docs.inceptionlabs.ai/get-started/models#mercury-2

Yes, Mercury 2 is a reasoning model [0].

[0] https://docs.inceptionlabs.ai/get-started/models#mercury-2

Introspective Diffusion
Language Models

69.6

AIME-24 (I-DLM-8B)
vs. LLaDA-2.1-mini 43.3

45.7

LCB-v6 (I-DLM-8B)
vs. LLaDA-2.1-mini 30.4

2.9-4.1x

Throughput over
LLaDA-2.1-mini at C=64

Lossless

Bit-for-bit identical
to base AR model

Abstract

Diffusion language models (DLMs) offer a compelling promise: parallel token generation could break the sequential bottleneck of autoregressive (AR) decoding. Yet in practice, DLMs consistently lag behind AR models in quality.

We argue that this gap stems from a fundamental failure of introspective consistency: AR models agree with what they generate, whereas DLMs often do not. We introduce the Introspective Diffusion Language Model (I-DLM), which uses introspective strided decoding (ISD) to verify previously generated tokens while advancing new ones in the same forward pass.

Empirically, I-DLM-8B is the first DLM to match the quality of its same-scale AR counterpart, outperforming LLaDA-2.1-mini (16B) by +26 on AIME-24 and +15 on LiveCodeBench-v6 with half the parameters, while delivering 2.9-4.1x throughput at high concurrency. With gated LoRA, ISD enables bit-for-bit lossless acceleration.

Why Introspective Consistency?

Key Insight: AR training unifies generation and introspection in one forward pass. Existing DLMs miss this — they learn to denoise but not to introspect.

We identify three fundamental bottlenecks in current DLMs:

(1) Low introspective consistency. SDAR: 0.699 vs. I-DLM: 0.984.

(2) Compute inefficiency. TiDAR: ~7.8x overhead vs. I-DLM: ~2.5x.

(3) Infrastructure mismatch. SDAR slope=84 vs. I-DLM: 549.

The I-DLM Method

Introspective-Consistency Training

Convert pretrained AR models via causal attention, logit shift, and an all-masked objective.

Introspective Strided Decoding

Generate N tokens per forward pass while verifying prior tokens via the p/q acceptance criterion.

AR-Compatible Serving

Strict causal attention enables direct integration into SGLang with no custom infrastructure.

Decoding paradigm comparison. I-DLM is a drop-in replacement within AR serving infrastructure.

Results

I-DLM is the first DLM to match same-scale AR quality while surpassing all prior DLMs across 15 benchmarks.

End-to-End Quality

Blue = best non-AR <30B. Bold = best non-AR <100B.

32B | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Knowledge & Reasoning | | ARC-C | 95.8 | 97.2 | 90.2 | --- | --- | 91.9 | 93.2 | --- | --- | 95.8 | 96.8 | | MMLU | 83.5 | 87.2 | 74.5 | --- | --- | 78.6 | 82.8 | --- | --- | 82.4 | 86.8 | | MMLU-Pro | 75.1 | 80.1 | 64.8 | 74.8 | 76.6 | 56.9 | 61.5 | --- | --- | 73.1 | 79.7 | | GPQA-D | 58.9 | 64.1 | 46.0 | --- | --- | 40.2 | 36.7 | --- | --- | 55.6 | 62.1 | | GPQA | 55.4 | 65.0 | 53.3 | 62.3 | 67.3 | --- | --- | --- | --- | 54.9 | 58.7 | | Math | | GSM8K | 96.0 | 94.7 | 89.0 | --- | --- | 91.7 | 91.4 | --- | --- | 95.0 | 94.9 | | MATH-500 | 95.8 | 97.8 | 85.0 | --- | --- | 78.6 | 77.8 | --- | --- | 96.8 | 97.6 | | MathBench | 93.1 | 95.5 | 84.2 | --- | --- | 76.9 | 79.3 | --- | --- | 89.1 | 95.6 | | AIME-24 | 73.1 | 76.7 | 43.3 | --- | --- | 10.0 | 16.7 | --- | --- | 69.6 | 83.3 | | AIME-25 | 65.4 | 80.0 | 43.3 | 60.0 | 63.3 | 10.0 | 10.8 | --- | --- | 60.8 | 80.0 | | Code | | HumanEval | 95.1 | 96.3 | 86.0 | --- | --- | 78.7 | 87.2 | 90.0 | 89.6 | 93.3 | 96.3 | | MBPP | 93.4 | 95.7 | 82.1 | --- | --- | 72.0 | 71.6 | 76.6 | 76.0 | 92.2 | 94.6 | | LCB-v6 | 50.3 | 58.3 | 30.4 | 42.5 | 45.4 | 16.6 | 21.7 | --- | --- | 45.7 | 57.1 | | Instruction Following | | IFEval | 84.7 | 84.5 | 83.2 | 82.6 | 83.6 | 61.4 | 60.6 | --- | --- | 84.7 | 84.7 |

Throughput

Throughput-latency tradeoff compared with DLMs across batch sizes (1, 4, 16, 64). I-DLM delivers 2.9-4.1x higher throughput than LLaDA-2.1-mini and SDAR at C=64.

Speedup Factor Explorer

In the memory-bound decode regime, TPF closely approximates wall-clock speedup: a TPF of 2.5 represents roughly 2.5x faster decoding than AR. Explore how acceptance rate and stride size affect this below.

I-DLM acceptance rate (p):0.90

R-ISD LoRA overhead (α):1.12

Gated LoRA adds compute at MASK positions for bit-for-bit lossless output. α=1.12 matches empirical overhead.

N=2

N=3

N=4

N=8

Memory-bound: Speedup ≈ TPF = (2+p+...+pN-2) / (2-pN-1)
R-ISD (lossless): Speedup ≈ TPF / α — gated LoRA guarantees bit-for-bit AR output.

How do DLMs perform as they approach compute-bound?

At high concurrency, forward pass latency scales with query count per forward. We can measure compute efficiency as TPF²/query_size — how much useful output each FLOP produces relative to AR (efficiency = 1):

SDAR (N=4, p=0.5): TPF ≈ 1.1, processes N=4 queries/forward → compute efficiency = 1.1²/4 ≈ 0.31. Each FLOP produces only 31% as much output as AR. This pushes SDAR into compute-bound early, and its throughput plateaus (batching efficiency slope = 84, see motivation figure).
I-DLM (N=4, p=0.9): TPF ≈ 2.9, processes 2N−1=7 queries/forward → compute efficiency = 2.9²/7 ≈ 1.22. Each FLOP produces more useful output than AR — I-DLM stays in the memory-bound regime at concurrency levels where SDAR is already saturated (batching efficiency slope = 549).

Efficiency > 1 means parallel decoding actually saves total compute vs. AR. This is why I-DLM's throughput scales with concurrency while SDAR and LLaDA plateau in the throughput figure above.

Per-Position Acceptance Breakdown

Acceptance compounds geometrically: position k has probability $p^{k-1}$. Position 1 is always accepted (logit shift).

Documentation & Resources

Everything you need to train, serve, and deploy I-DLM. Click any card to expand.

Installation

git clone https://github.com/Introspective-Diffusion/I-DLM.git cd I-DLM/inference bash install.sh

See inference/README.md for detailed environment setup.

Quick Start

1. Launch server:

python -m sglang.launch_server \ --model-path yifanyu/I-DLM-8B \ --trust-remote-code --tp-size 1 --dtype bfloat16 \ --mem-fraction-static 0.85 --max-running-requests 32 \ --attention-backend flashinfer --dllm-algorithm IDLMBlockN \ --dllm-algorithm-config inference/configs/idlm_blockN4_config.yaml \ --port 30000

2. Generate:

curl http://localhost:30000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "default", "messages": [{"role": "user", "content": "Prove that sqrt(2) is irrational."}], "max_tokens": 4096, "temperature": 1.0 }'

Training

Convert a pretrained AR model into I-DLM via introspective-consistency training:

Input: Concatenate fully-masked sequence with clean sequence [x_t | x_0]
Attention: Strict causal masking across all positions
Loss: Auto-balanced CE on both masked and clean positions
Data: 4.5B tokens, 8 H100 GPUs, 2 epochs with stride curriculum (N=2 then N=3)

See training/README.md for scripts and configs.

Inference & ISD

Introspective Strided Decoding (ISD) generates and verifies in a single forward pass:

MASK positions: Propose new tokens (distribution q)
Clean positions: Verify prior tokens (anchor distribution p)
Acceptance: min(1, p(x)/q(x)) guarantees AR-distribution output
Stride N=4: TPF=2.96, ~3x wall-clock speedup in memory-bound regime

See inference/README.md for algorithm configs.

Serving (SGLang)

I-DLM uses strict causal attention, enabling direct integration into SGLang with no custom infrastructure:

Paged KV cache and continuous batching
CUDA graph capture (+42-76% throughput)
Stationary-batch decode-loop scheduling (+11-21%)
Argmax proposals (+11-15%)
Paged-only attention kernel (+10-14%)

Full system achieves 2.1-2.5x throughput over naive baseline.

Lossless R-ISD

Residual ISD (R-ISD) adds a gated LoRA adapter for bit-for-bit lossless acceleration:

LoRA active only at MASK positions; verify positions use base-only weights
Output is identical to the base AR model by construction
LoRA rank=128, overhead factor ~1.12x
Model: yifanyu/I-DLM-8B-lora-r128

Model Zoo

Model	Base	Description
I-DLM-8B	Qwen3-8B	Main model, matches AR quality
I-DLM-32B	Qwen3-32B	Large scale, outperforms LLaDA-2.1-flash (100B)
I-DLM-8B-LoRA	Qwen3-8B	Gated LoRA (rank=128) for lossless R-ISD

All models use trust_remote_code=True (custom SDARForCausalLM architecture).

Benchmarks

We evaluate on 15 benchmarks across 4 categories with thinking mode enabled:

Knowledge: ARC-C, MMLU, MMLU-Pro, GPQA-D, GPQA
Math: GSM8K, MATH-500, MathBench, AIME-24, AIME-25
Code: HumanEval, MBPP, LiveCodeBench-v6
Instruction: IFEval

See inference/eval/ for reproduction scripts.

Citation

@article{yu2026introspective, title={Introspective Diffusion Language Models}, author={Yu, Yifan and Jian, Yuqing and Wang, Junxiong and Zhou, Zhongzhu and Zhuang, Donglin and Fang, Xinyu and Yanamandra, Sri and Wu, Xiaoxia and Wu, Qingyang and Song, Shuaiwen Leon and Dao, Tri and Athiwaratkun, Ben and Zou, James and Lai, Fan and Xu, Chenfeng}, journal={arXiv preprint arXiv:7471639}, year={2026} }

Hacker Times

Hacker Times

Introspective Diffusion Language Models

Discussion

Discussion

Introspective Diffusion
Language Models

Abstract

Why Introspective Consistency?

The I-DLM Method

Introspective-Consistency Training

Introspective Strided Decoding

AR-Compatible Serving

Results

End-to-End Quality

Throughput

Speedup Factor Explorer

Per-Position Acceptance Breakdown

Documentation & Resources

Installation

Quick Start

Training

Inference & ISD

Serving (SGLang)

Lossless R-ISD

Model Zoo

Benchmarks

Citation

Hacker Times

Hacker Times

Introspective Diffusion Language Models

Discussion

Discussion

Introspective DiffusionLanguage Models

Abstract

Why Introspective Consistency?

The I-DLM Method

Introspective-Consistency Training

Introspective Strided Decoding

AR-Compatible Serving

Results

End-to-End Quality

Throughput

Speedup Factor Explorer

Per-Position Acceptance Breakdown

Documentation & Resources

Installation

Quick Start

Training

Inference & ISD

Serving (SGLang)

Lossless R-ISD

Model Zoo

Benchmarks

Citation

Introspective Diffusion
Language Models