Eagle 3.1: Collaboration Between the EAGLE Team, vLLM Team, and TorchSpec Team

> performance often degrades under different chat templates, long-context inputs, or out-of-distribution system prompts.

I heard that speculative decoding doesn't affect performance (I meant accuracy). Am I wrong about it?

Are these speculative decoders ok to use for AI coding agents or do they only fit certain workloads?

  The EAGLE team traced this fragility to a phenomenon we call ‘attention drift’

Ok that’s downright fascinating. I am one of the world’s foremost experts on the AI psychosis sufferers posting grand theories on Reddit, and ‘drift’ is one of the words that chatbots come back to again and again when told to ponder their own Being (so much so that it even shows up in clearly-unrelated/incorrect contexts — pretty sure I’ve seen both ‘quantum drift’ and ‘spiritual drift’).

It’s probably the #3 most common, after ‘recursion’ and ‘coherence’; I bet ‘coherence drift’ has popped up a thousand times by now, but ‘attention drift’, ‘token drift’, ‘spiritual drift’, ‘cognitive drift’, and ‘semantic drift’ have all gotten airtime AFAIR.

Obviously the primary thing going on there is vulnerable laypeople convincing themselves that they’ve cracked some major part of science, but I do honestly wonder about the unintentional throughlines… This might be the first time I’ve noticed one of them show up in a real paper, though.

Is there some intuitive wisdom in how LLMs tend to approach themselves, perhaps? Or are those terms inevitable when talking via and/or about a 1:1 turn-taking conversation?

I saw EAGLE and thought it's going to be about PCB design. Was left disappointed.

Are these speculative decoders ok to use for AI coding agents or do they only fit certain workloads?

They work better for coding workloads. Essentially, the more regular the output, the more the faster model gets right, the less the big model has to do.

Writing tends to have more false positives. I haven't tried this particular one, however, but that is the general trend.

Speculative decoding shouldn't actually change the accuracy of the response. The draft model drafts a couple tokens, and the inference framework verifies that the larger model would have picked them.

However, I've found that speculative decoders don't help much if you're running a model locally on limited hardware (for instance, my 32GB VRAM M1 Max from 2021). For one, you have to fit both the large and the small drafter model in memory. For another, if you're running a quantized model, the activation distribution is different enough that the draft model has a hard time guessing what's coming next.

My take is that speculative decoding is most useful on _very expensive_ prosumer/hobbyist setups where you have 128GB of VRAM and are running your local models with full fidelity. It's also helpful for inference providers where they can send output tokens at a computational cost slightly higher than their input token cost.

I think so, the benchmark is on a coding dataset (SPEED-Bench).

I saw EAGLE and thought it's going to be about PCB design. Was left disappointed.

Well, there are only so many nouns, and even fewer "cool-sounding" ones. For better project differentiation, do you think we should instead be naming things "ZurgGlurg327"? I'm sure you can find a completely-unique combo for each thing, but good luck remembering the name!

autorouters are the closest thing to "just spin some cycles to do this for me"

Same here. KiCad is great, but I miss EAGLE.

It feels like AI folks are particularly careless about checking if a name has been used in computing before.

> performance often degrades under different chat templates, long-context inputs, or out-of-distribution system prompts.

I heard that speculative decoding doesn't affect performance (I meant accuracy). Am I wrong about it?

  The EAGLE team traced this fragility to a phenomenon we call ‘attention drift’

Is there some intuitive wisdom in how LLMs tend to approach themselves, perhaps? Or are those terms inevitable when talking via and/or about a 1:1 turn-taking conversation?

They work better for coding workloads. Essentially, the more regular the output, the more the faster model gets right, the less the big model has to do.

Writing tends to have more false positives. I haven't tried this particular one, however, but that is the general trend.

I think so, the benchmark is on a coding dataset (SPEED-Bench).

You're not wrong about that. Speculative decoding does not affect the quality of tokens generated, as each token has to be verified by the parent model before it is output.

Each of the tokens generated by the draft model has to be verified by the parent/original model, but if this acceptance rate falls, then the speedup from speculative decoding would be eliminated. This acceptance rate, and more directly the speedup from draft models, is what "performance" refer s to in the article.

When one is talking about two things gradually diverging, isn't "drift" a natural, descriptive verb to reach for? I've heard it often when discussing an implementation diverging from a specification, for example.

Speculative decoding shouldn't actually change the accuracy of the response. The draft model drafts a couple tokens, and the inference framework verifies that the larger model would have picked them.

Your experience might be a bit dated, depending on when was the last time you tried it. MTP (which is a flavor of spec decoding) is showing really solid improvements on local models, even on consumer hardware.

In fact, as the article mentions, you get the biggest gains at low concurrency (so local should apply), with diminishing returns for higher concurrency (if you think in terms of unit of compute, it's probably better to serve more requests in parallel and get more throughput that way).

Eagle3 was great at low context tho, and this seems to improve things at high context. That's really cool, and hopefully it'll turn oout to be useful at those lengths. Eagle3 is also training dependant, so you could try training your own, if your use-cases diverge enough that 3rd party "generalist" models don't suit your needs. (in general nvda, redhat, etc. have provided general eagle3 models for popular families).

autorouters are the closest thing to "just spin some cycles to do this for me"

Same here. KiCad is great, but I miss EAGLE.

It feels like AI folks are particularly careless about checking if a name has been used in computing before.

You're not wrong about that. Speculative decoding does not affect the quality of tokens generated, as each token has to be verified by the parent model before it is output.

So the draft model's performance is directly linked to the overall speed. Thank you for the explanation!

By the way, can it be slower than without speculative decoding in worst case then?

docker images and ubuntu releases use an adjective, this could at least allow some alternatives like bold/supreme/decisive/depressed eagle (or just use battery staple)

So the draft model's performance is directly linked to the overall speed. Thank you for the explanation!

By the way, can it be slower than without speculative decoding in worst case then?

    > can it be slower than without speculative decoding in worst case then?

Yes - running the draft model costs compute and memory bandwidth, and running the drafted futures through the main model costs compute. If the draft model were really inaccurate or you're already compute-limited (usually: running large batches) you would expect some slowdown.

In practice, for single-user (non-batched) inference with a working configuration, you pretty much always get some speedup. For non-coding tasks I've seen it be nearly a wash for some people, in which case you might want to avoid it due to the extra memory usage (you'd rather use that memory to run a bigger quant/model, even at a slightly lower speed).

    > can it be slower than without speculative decoding in worst case then?

Hacker Times

Hacker Times

Eagle 3.1: Collaboration Between the EAGLE Team, vLLM Team, and TorchSpec Team

Discussion

Discussion