Real-time LLM Inference on Standard GPUs: 3k tokens/s per request

This looks very interesting. Possible to get those rates without exotic hardware.

But I have to say that the comparison is not really fair. Comparison is done with a 2 B model vs frontier models that are likely 100s of times larger. Also taalas with their 15000 tok/s inference are suspiciously missing from the comparison.

We need to see the comparison with this framework and useful models, which at present seems to mean ~30 B.

Follow-up reading the most technical and research people here:

Monokernel deep dive (GPU Engineering): http://blog.kog.ai/building-a-single-kernel-latency-optimize...

Delayed Tensor Parallelism (research): http://blog.kog.ai/delayed-tensor-parallelism-for-faster-tra...

To try the speed on the playground: http://playground.kog.ai

For me it's 3.4k tok/s of pure nonsense, the model is bad, you tell it it's wrong, it acknowledge it's wrong and repeats the same nonsense. It reminds me my nephew though. Ask it something like: "I want to play the guitar on the surface of the Moon. What speakers do you suggest." and then "But Moon has no atmosphere, how the sound will travel?".

This is very cool.

I have been lamenting for a while that the memory-bandwidth <-> tps relationship was pretty much working for small models on consumer cards, but not at all on datacenter hardware.

It's great to see that with proper care on the inference engine implementation the relationship can be restored.

Do you think the work will still apply to speculative/alternative decoding methods like MTP and block diffusion, which are making batch=1 decoding less memory bound? Kernel launch overhead and memory transfer become less and less significant as a % of time when computing multiple tokens at once.

When I read "Standard GPUs" in the title I got excited for a second then I read the article itself..

> Standard GPUs

> 8× NVIDIA H200

Don't miss trying their demo: https://playground.kog.ai/

Feels like a preview of the future

H200 isn't a standard GPU at all

Congrats gaeld and team

The demo is very impressive!

disclaimer: I've known the founder for a while, as legitimate as it gets in deep tech, real years of research and engineering behind this, not vaporware

Huh, interesting. Some parts of this do generalize even to an RTX 6000 Pro Blackwell, I imagine, though we're going to be solidly bottlenecked then on inter-card throughput through the PCIe interface.

Looks super promising! A couple of questions:

For new open weights models, will you need to adapt model code and optimization for your inference engine by hand?

It's true that BS=1 is king when it comes to agentic workflows, however these kinds of system serve multiple requests concurrently with dynamic batching. Do you think it will scale as well ?

Any plans to release it open source?

Congratz again for the release

Could be amazing, but it's hard to judge if it will really work with say a 27 B model or larger. We can already get pretty good speed with a 2B model.

An article with a title saying tokens per second throughput without any qualifier e.g. what size the model is should immediately be classified as spam.

>This preview runs a 2B model

I guess with 1B or 500M model inference would be even faster?

Making these claims on a 2B parameter model seems a bit like seeing linear scalability from 1 to 4 cores and then assuming 256 cores will give you a 256x speedup. Or demonstrating massive improvement on datasets that fit in cache and then assuming the same improvements will be present on problem sizes that span the memory of multiple machines. Something tells me that scaling to larger models will be more difficult than assumed.

I have a naive question here - first, the token speed is very impressive. but why this is the highlight? I would prefer the actual performance.

I had to test it myself to believe this unreal inference speed.

each time getting 3300+ tps.

I can think of real time video, shader generation, real time worldbuilding type problems could require such a high token throughput.

For instant code generatio, 400-500 tok/s should be sufficient, though most frontier models give us closer to 70 tok/s.

Is this the new gateway to a "Model On a Chip"? Is it possible to etch the weights on silicon and get a very efficient way to use a LLM?

Title is pure bait. Where is Datacenter GPU gone?

NVIDIA H200 Is not a standard GPU. 8 of them in a box with a cpu and ram costs close to the same as a house.

I am 100% all about using local models instead of sending someone else all my data and paying for the privilege of doing so, this article is misleading.

I can get a 27b model to kick out 40 tok/s on 16 gb vram. This is the area ripe for development.

If you can’t connect a monitor, it isn’t a standard GPU, at least not in the way people have spoken about GPUs until a few years ago.

I feel the comparison to Groq is unfair. They're running much larger models (orders of magnitude) and still reaching competitive speeds.

That's really nice of them.

That means Jensen can add another 30 times faster when comparing Rubin to Blackwell without having to actually do anything.

Hopefully that means he won't have any problem to make another 150 billion in profit in the next year.

Sorry for the sarcasm. Looks like interesting work.

This looks very interesting. Possible to get those rates without exotic hardware.

We need to see the comparison with this framework and useful models, which at present seems to mean ~30 B.

Great points.

We strived to be fair as possible in the benchmark, but it's indeed not perfect. Taalas should have been added in the dedicated hardware section, even though they use 3-bit quantization when we are on FP16 (to be fair in both directions) and they burn the model directly on the card.

Our tech preview is about the speed (hence the small dense model, it was easier to implement).

The math checks out though to allow support for large frontier MoE models at similar speeds: - At batch size 1, GPT-OSS-120B has 5.1B active parameters - in FP8, it's in the same size ballpark than our 2B model in FP16 (5.1 GB vs 4GB). - DeepSeek V4 Flash has 13B in mixed FP4/FP8, so let's say ballpark around 3x bigger than 4GB - so in theory we could reach >1,000 tok/s on it with MI300X/H200 and up to 4k on next generation GPUs.

Check out the math at the end of our blog post:

https://blog.kog.ai/real-time-llm-inference-on-standard-gpus...

They got 1K tok/s with Deepseek v4 Pro. That's kinda cool..

Fallacies look interesting ? Like if we aren't getting dubious claims every day ?

likely the small model makes whatever fuzzer they designed to poke the gpus much faster optimizations.

they seem to think it scales up because theyre shortening the stack.

Great points.

Our tech preview is about the speed (hence the small dense model, it was easier to implement).

Check out the math at the end of our blog post:

https://blog.kog.ai/real-time-llm-inference-on-standard-gpus...

Your playground/write-up is very interesting and I would be really interested when you can have something like Deepseek V4 Flash model (49B) running as you are suggesting.

I haven't read the article at the moment and I will try to read them hopefully but I wish to ask a question regarding, can this approach be done for say trillion or large parameter models as well or is there some wall which gets hit that makes it valuable for only smaller parameter model.

That being said, its still really incredible because in future, because these small models are really getting good for many use cases and speed becomes their bottleneck, with greater speeds at consumer hardware, I think its gonna be amazing work!

An article with a title saying tokens per second throughput without any qualifier e.g. what size the model is should immediately be classified as spam.

Congrats gaeld and team

The demo is very impressive!

disclaimer: I've known the founder for a while, as legitimate as it gets in deep tech, real years of research and engineering behind this, not vaporware

This is very cool.

I have been lamenting for a while that the memory-bandwidth <-> tps relationship was pretty much working for small models on consumer cards, but not at all on datacenter hardware.

It's great to see that with proper care on the inference engine implementation the relationship can be restored.

Could be amazing, but it's hard to judge if it will really work with say a 27 B model or larger. We can already get pretty good speed with a 2B model.

thanks! we explain how it scales to larger models in the last section the OP blog post

Follow-up reading the most technical and research people here:

Monokernel deep dive (GPU Engineering): http://blog.kog.ai/building-a-single-kernel-latency-optimize...

Delayed Tensor Parallelism (research): http://blog.kog.ai/delayed-tensor-parallelism-for-faster-tra...

To try the speed on the playground: http://playground.kog.ai

> Standard GPUs

> 8× NVIDIA H200

Looks super promising! A couple of questions:

For new open weights models, will you need to adapt model code and optimization for your inference engine by hand?

It's true that BS=1 is king when it comes to agentic workflows, however these kinds of system serve multiple requests concurrently with dynamic batching. Do you think it will scale as well ?

Any plans to release it open source?

Congratz again for the release

>This preview runs a 2B model

I guess with 1B or 500M model inference would be even faster?

Title is pure bait. Where is Datacenter GPU gone?

Is this the new gateway to a "Model On a Chip"? Is it possible to etch the weights on silicon and get a very efficient way to use a LLM?

I had to test it myself to believe this unreal inference speed.

each time getting 3300+ tps.

That's really nice of them.

That means Jensen can add another 30 times faster when comparing Rubin to Blackwell without having to actually do anything.

Hopefully that means he won't have any problem to make another 150 billion in profit in the next year.

Sorry for the sarcasm. Looks like interesting work.

Why not, it's one way to look at it! Although I have yet to see other work with speculative decoding higher than ~1,000 tokens/s., because the other bottlenecks start to matter at that point, and they need to be solved to go further.

Our view is that MTP / speculative decoding could help getting a X multiplier (X = 2 to 6) on the tokens per second speed we currently achieve.

We are a bit greedy, we want to stack optimizations on top of each other to get the maximum speed possible.

It involves additional compute to verify the predicted tokens during the forward pass (it's like a small batch), which should be totally doable for dense models, and will be more tricky for MoEs because it could mean activating more experts and thus more active parameters.

It looks like DTP is a distinct architectural choice that would require training new models accordingly? This wouldn't be able to just run inference for existing models.

as not custom chips like Grog and Cerebras. Did you expect a single GPU chip to reach 3k tps?

Everyone beholden to a data center or subject to the installation on the corner of your property of course. Keep up with the times... /s

Thanks a lot! Much appreciated.

To answer your questions:

- yes, we rewrite the whole model code (while keeping the same logic) in CUDA/HIP and assembly, in order to optimize by hand for each GPU type. It's quite tedious for sure, but I guess this is the price to pay to get this kind of results.

- the batching question is a great one. In agentic systems, there is probably a trade-off between sequential thinking/iterations vs parallel exploration of multiple solutions. Also, there could just be multiple independent tasks running in parallel, depending on the use case.

We plan to support a small amount of batching, but it quickly becomes a trade-off vs speed. Pick one for your use case, I guess.

Also to consider: because we answer requests much faster, we are also able to process lots of them without needing high batches - and scaling on multiple nodes is possible.

- open sourcing: maybe, maybe not. I'm still undecided on this. We are a small startup and I'm told that giving our IP away might be shooting ourselves in the feet. On the other side, I think it could be of great benefit to the community and for us... we'll see

Don't miss trying their demo: https://playground.kog.ai/

Feels like a preview of the future

You can get something pretty fast right now with a Cerebras Coder subscription, sadly I think the best model they had last I checked was the somewhat dated GLM 4.7: https://inference-docs.cerebras.ai/models/overview

I feel like if they got DeepSeek V4 Flash and Pro running on their hardware, even if at less than 1000 tok/s, they’d still be crushing it with any subscription they’d provide, given how generous their token limits were.

As for the demo it's fast and extremely dumb like expected for 2B. I asked how to stop drinking habit and in just one follow-up message it recommended trying 8% ABV. Hilarious.

https://chatjimmy.ai/ from Taalas also feels like that.

I have a naive question here - first, the token speed is very impressive. but why this is the highlight? I would prefer the actual performance.

Token generation speed matters for sequential agentic workflows, like software engineering / vibe coding, where a lot of reasoning tokens, code generation, refactoring, testing, etc. happen in a loop before an actual outcome is served to the user.

About model performance, we plan to support the latest frontier models (this tech preview is about the speed of the engine)

I feel the comparison to Groq is unfair. They're running much larger models (orders of magnitude) and still reaching competitive speeds.

Fair point - this tech preview is about the speed (hence the small dense model, it was easier to implement).

The math checks out though to allow support for large frontier MoE models at similar speeds.

At batch size 1, GPT-OSS-120B has 5.1B active parameters - in FP8, it's in the same size ballpark than our 2B model in FP16 (5.1 GB vs 4GB).

DeepSeek V4 Flash has 13B in mixed FP4/FP8.

Check out the math at the end of our blog post: https://blog.kog.ai/real-time-llm-inference-on-standard-gpus...

I can think of real time video, shader generation, real time worldbuilding type problems could require such a high token throughput.

For instant code generatio, 400-500 tok/s should be sufficient, though most frontier models give us closer to 70 tok/s.

That sounds a little bit like the 64kb memory is enough, then someone invented electron ;P

But joke aside, I think we don't even know yet what is possible if you hit very fast very high token / second numbers if your whole ecosystem behind it can handle it.

You could literaly implement the same solution 100x and benchmark all of them and get only the best result.

You could build and architecture a whole stack in parallel.

You could do massive thinking token / chain of thought.

You could let the LLM analyse everything around you while you type. Like it could tell you that this might create a bug in a different file and why.

We could start doing some type of monte-carlo search with this.

H200 isn't a standard GPU at all

I think they accidentally left out “standard data-center GPUs” from the title. That probably needs fixing. My “standard” GPU is still a 3090

When I read "Standard GPUs" in the title I got excited for a second then I read the article itself..

Yeah, it should have been "Datacenter GPUs" or "Nvidia and AMD GPUs".

what did you have in mind when you read "Standard GPUs"?

NVIDIA H200 Is not a standard GPU. 8 of them in a box with a cpu and ram costs close to the same as a house.

I am 100% all about using local models instead of sending someone else all my data and paying for the privilege of doing so, this article is misleading.

I can get a 27b model to kick out 40 tok/s on 16 gb vram. This is the area ripe for development.

If you can’t connect a monitor, it isn’t a standard GPU, at least not in the way people have spoken about GPUs until a few years ago.

I guessed you thought about consumer GPUs. We are about standard datacenter GPUs indeed.

Sorry for the confusion

Yeah, I agree: I'm actually not expecting it to be easy, and there will certainly be several unknown unknowns we'll discover along the way.

Our process has been, and will continue to be, a sequence of (tedious) R&D experiments where the GPU never behaves as expected when pushed to its limits in ways no-one really tested before (I still have nightmares of the L3 cache cross-IOD bottlenecks on MI300X).

IMHO, we did solve the multi-GPU memory bandwidth scaling problem, and thus the linear scaling of the size of the model towards infinity. But the main difficulties will come from keeping the speed, with steady and continuous memory streaming, while implementing the much more complex architecture of modern frontier MoEs (attention compression tricks, hash layers, routing logic, etc.)

Your playground/write-up is very interesting and I would be really interested when you can have something like Deepseek V4 Flash model (49B) running as you are suggesting.

Consumer inference scenarios tend to be highly bespoke so it's difficult to apply a monokernel approach based on deep manual optimization. I suppose this could become applicable to rare scenarios where both the model and the hardware are fixed and self-contained, e.g. I'm running Apple's AI model on the latest Apple Silicon hardware. Then this becomes a viable approach even for 'consumer' use.

The authors' approach also encompasses multi-node approaches that won't apply easily to consumer inference since consumer GPUs have very low-performance interconnects, hence why layer parallelism is usually favored. (But that doesn't work very well with the monokernel approach, since it involves running distinct logic on each separate GPU. It also doesn't speed up single inference, though you can get that throughput back by pipelining small minibatches.)

Thanks for the comment and the question!

The last section of the article lays out the scaling laws that apply when porting this approach to another model. In a nutshell, DeepSeek V4 Pro with 49B active params is close to the upper bound.

Also worth noting that our results are currently for standard datacenter GPUs. On consumer hardware, though the same low-level optimization approach applies, the bandwidth limitations will cap the achievable speed.

Fallacies look interesting ? Like if we aren't getting dubious claims every day ?

likely the small model makes whatever fuzzer they designed to poke the gpus much faster optimizations.

they seem to think it scales up because theyre shortening the stack.

They got 1K tok/s with Deepseek v4 Pro. That's kinda cool..

Thanks. To be fair, this number is what we expect to get once we port DeepSeek V4 in our engine on the upcoming generation of GPUs!

thanks! we explain how it scales to larger models in the last section the OP blog post

Shame you stopped short of actually benchmarking that scale though, eh?

Our view is that MTP / speculative decoding could help getting a X multiplier (X = 2 to 6) on the tokens per second speed we currently achieve.

We are a bit greedy, we want to stack optimizations on top of each other to get the maximum speed possible.

https://chatjimmy.ai/ from Taalas also feels like that.

Yeah, it should have been "Datacenter GPUs" or "Nvidia and AMD GPUs".

It looks like DTP is a distinct architectural choice that would require training new models accordingly? This wouldn't be able to just run inference for existing models.

Totally, though DTP is not required for these kind of speeds. Standard TP works also.

DTP is something we built for our roadmap in order to get to extremely high speeds (like 10k+ tokens/s). When the budget is under 10 µs per layer, any little overhead matters.

For 1k to 5k tokens/s, regular TP still works because we are able to optimize the inter-GPU all-reduce collectives at under 3 µs, which allows to continue streaming model weights in shared memory, registers and caches while GPUs exchange data.

About model performance, we plan to support the latest frontier models (this tech preview is about the speed of the engine)

As for the demo it's fast and extremely dumb like expected for 2B. I asked how to stop drinking habit and in just one follow-up message it recommended trying 8% ABV. Hilarious.

Fair point - this tech preview is about the speed (hence the small dense model, it was easier to implement).

The math checks out though to allow support for large frontier MoE models at similar speeds.

At batch size 1, GPT-OSS-120B has 5.1B active parameters - in FP8, it's in the same size ballpark than our 2B model in FP16 (5.1 GB vs 4GB).

DeepSeek V4 Flash has 13B in mixed FP4/FP8.

Check out the math at the end of our blog post: https://blog.kog.ai/real-time-llm-inference-on-standard-gpus...

I think they accidentally left out “standard data-center GPUs” from the title. That probably needs fixing. My “standard” GPU is still a 3090

That sounds a little bit like the 64kb memory is enough, then someone invented electron ;P

But joke aside, I think we don't even know yet what is possible if you hit very fast very high token / second numbers if your whole ecosystem behind it can handle it.

You could literaly implement the same solution 100x and benchmark all of them and get only the best result.

You could build and architecture a whole stack in parallel.

You could do massive thinking token / chain of thought.

You could let the LLM analyse everything around you while you type. Like it could tell you that this might create a bug in a different file and why.

We could start doing some type of monte-carlo search with this.

Everyone beholden to a data center or subject to the installation on the corner of your property of course. Keep up with the times... /s

Thanks a lot! Much appreciated.

To answer your questions:

We plan to support a small amount of batching, but it quickly becomes a trade-off vs speed. Pick one for your use case, I guess.

Also to consider: because we answer requests much faster, we are also able to process lots of them without needing high batches - and scaling on multiple nodes is possible.

Thanks for the comment and the question!

The last section of the article lays out the scaling laws that apply when porting this approach to another model. In a nutshell, DeepSeek V4 Pro with 49B active params is close to the upper bound.

Yeah, I agree: I'm actually not expecting it to be easy, and there will certainly be several unknown unknowns we'll discover along the way.

Thanks. To be fair, this number is what we expect to get once we port DeepSeek V4 in our engine on the upcoming generation of GPUs!

it's also a coding model

what did you have in mind when you read "Standard GPUs"?

The GPU in my desktop. (A normal-ish decent gaming machine that runs LLMs and txt2img well enough.)

In contrast, not enterprise GPUs that cost as much as a car.

I guessed you thought about consumer GPUs. We are about standard datacenter GPUs indeed.

You know, Radeon 9800 pro ago

as not custom chips like Grog and Cerebras. Did you expect a single GPU chip to reach 3k tps?

I think many would assume "not enterprise" or "not datacenter grade" when someone says "Standard GPUs", but maybe that specific phrase have a specific meaning I'm not familiar with.

Edit: I just tried a 4B model on a RTX Pro 6000, getting ~500 tok/s with llama.cpp not even trying to optimize or change anything, just default settings. I'm sure with vLLM it'd be a lot faster already, still before manually tuning configs. I wouldn't call that card "Standard GPU" either FWIW, but it makes the claimed performance numbers feel not as exciting, especially given the hardware they were using.

> Did you expect a single GPU chip to reach 3k tps?

Did the article headline not say Standard GPU?

so what would be the above-standard GPUs then that they are excluding? Cerebras is not GPU

I guessed you thought about consumer GPUs. We are about standard datacenter GPUs indeed.

Sorry for the confusion

Totally, though DTP is not required for these kind of speeds. Standard TP works also.

DTP is something we built for our roadmap in order to get to extremely high speeds (like 10k+ tokens/s). When the budget is under 10 µs per layer, any little overhead matters.

scenarios where both the model and the hardware are fixed and self-contained

That's basically antirez's DS4 and it works pretty well because there are few leading models and few hardware platforms (Apple, GB10, Strix Halo) that are worth using.

Do you think maybe changing your articles title from "Real-time LLM Inference on Standard GPUs" to "Real-time LLM Inference on Standard Datacenter GPUs" might make sense here? Given more people seem confused by the title than not, and you could clear this up relatively easily, at least on your website although might be late to fix the HN title.

Oh, it isn't confusing, it is misleading. A standard GPU lets you connect a monitor. A datacenter GPU lets you do headless math.

Shame you stopped short of actually benchmarking that scale though, eh?

will do - we are a small team and it takes time to implement and optimize a new model, whatever the size.

it's also a coding model

Nah. it says it can't even write python code

I guessed you thought about consumer GPUs. We are about standard datacenter GPUs indeed.

How would you classify a datacenter GPU as standard/non-standard? That doesn't seem to be a meaningful distinction. It's click bait.

What a lot of use on here are salivating for is the ability to run these on prosumer hardware at home. So we tend to jump to the conclusion that "standard" means "consumer-grade" because that's what we want to see. Still, very cool work!

The GPU in my desktop. (A normal-ish decent gaming machine that runs LLMs and txt2img well enough.)

In contrast, not enterprise GPUs that cost as much as a car.

so what would be the above-standard GPUs then that they are excluding? Cerebras is not GPU

You know, Radeon 9800 pro ago

> Did you expect a single GPU chip to reach 3k tps?

Did the article headline not say Standard GPU?

I think many would assume "not enterprise" or "not datacenter grade" when someone says "Standard GPUs", but maybe that specific phrase have a specific meaning I'm not familiar with.

I expected a 4090, maybe 2. I did not expect 8xH200 for a 2B model.

scenarios where both the model and the hardware are fixed and self-contained

That's basically antirez's DS4 and it works pretty well because there are few leading models and few hardware platforms (Apple, GB10, Strix Halo) that are worth using.

will do - we are a small team and it takes time to implement and optimize a new model, whatever the size.

Oh, it isn't confusing, it is misleading. A standard GPU lets you connect a monitor. A datacenter GPU lets you do headless math.

I updated the article title accordingly

YES - I just updated the title of our article according to your suggestion.

Nah. it says it can't even write python code

I tried with some simple prompts (fibonacci, linked list manipulation) and it worked nicely.

How would you classify a datacenter GPU as standard/non-standard? That doesn't seem to be a meaningful distinction. It's click bait.

The blog makes it clear that "standard" GPU here is in opposition to purpose-built hardware like Cerebras. The selling point is reaching the same order of magnitude in generative speed as those approaches.

thank you deflator, I understand this now! much appreciated

I expected a 4090, maybe 2. I did not expect 8xH200 for a 2B model.

Great points, let me clarify:

- model size: 2B is just for this preview (it was faster to implement), our article explains how we expect to support large frontier MoE at 1,000 to 5,000 tokens/s

- reaching 500 tok/s, or even up to ~1,000 tok/s, on a consumer GPU card is possible with existing inference engines like vLLM. But there is a ceiling.

The hard part comes we you try to be faster than that: these frameworks won't scale higher just by adding GPUs or using faster GPUs. There is a "glass ceiling" due to microseconds lost everywhere in the stack (grid syncs, inter-GPU comms, kernel launches, CPU sampling, etc.).

All our work at Kog is about removing these bottlenecks.

YES - I just updated the title of our article according to your suggestion.

I updated the article title accordingly

Standard != Datacentre

I tried with some simple prompts (fibonacci, linked list manipulation) and it worked nicely.

thank you deflator, I understand this now! much appreciated

Great points, let me clarify:

- model size: 2B is just for this preview (it was faster to implement), our article explains how we expect to support large frontier MoE at 1,000 to 5,000 tokens/s

- reaching 500 tok/s, or even up to ~1,000 tok/s, on a consumer GPU card is possible with existing inference engines like vLLM. But there is a ceiling.

All our work at Kog is about removing these bottlenecks.

Thank you for explaining. Do you think there are still opportunities for stack optimizations to meaningfully speed up inference on single consumer-grade GPUs?

That doesn't clarify anything lol. It's a bit click baity.

Standard != Datacentre

Thank you for explaining. Do you think there are still opportunities for stack optimizations to meaningfully speed up inference on single consumer-grade GPUs?

That doesn't clarify anything lol. It's a bit click baity.

Inference

Today, Kog AI launches a tech preview of the Kog Inference Engine (KIE): 3,000 output tokens/s per request on 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200 (FP16, no speculative decoding). This preview runs a 2B model, with support for large third-party MoE models coming next at similar speeds.

Real-time LLM Inference on Standard Datacenter GPUs (3,000 tokens/s per request)

(see below for full benchmark details)

TL;DR: we show that AI inference on GPUs can be super-fast, reaching the speed regime of dedicated inference hardware cards when optimizing the whole software stack with architecture/engine/kernel co-design. Test the speed in our live coding playground: playground.kog.ai.

This post explains why optimizing for single-request LLM decoding speed is important for AI agents; why it's primarily a memory-bandwidth maximization problem, not a FLOPS one; why standard datacenter GPU hardware has a much higher decoding-speed ceiling than current inference stacks expose due to software bottlenecks; and how that ceiling can be reached (even on large MoE models) by co-designing the model architecture, runtime, and low-level GPU code as a single latency-optimized pipeline.

Our public tech preview is about proving that extremely fast single-request decoding is possible on the standard datacenter GPUs enterprises already own — including AI labs and sovereign-AI buyers. The limiting factor has been that existing inference software stacks are not optimized for this type of workload. Opening the GPU path could deliver that speed without the lock-in of proprietary silicon.

You can test the speed of our 2B coding model today. It's small and not a frontier model (we've been focused on speed rather than scale), though still quite capable when fine-tuned for specific software engineering tasks.

What autonomous agents change: single-request decode speed is now the metric that matters

Inference benchmarks typically conflate three quantities. Aggregate throughput (total tokens generated per second across all users) measures server utilization and rewards large batches. Time to first token measures prefill latency. Decode speed per request measures token generation speed and defines how long one user waits before receiving the full response. That last one governs every long serial interaction, and it's what AI agents are bottlenecked on.

Agentic software engineering is a sequential loop: inspect, plan, edit, test, revise. Each step depends on the previous one. Tool time sometimes dominates, as tests have to run and web pages have to load, but the generation-heavy steps (planning, code writing, trace analysis, debugging, refactoring) set the loop rate. And reasoning tokens compound on top.

The numbers translate directly into product and user experience. If an agent needs to generate 50,000 tokens in a workflow, 100 tokens/s is roughly eight minutes; 3,000 tokens/s is under twenty seconds. The difference changes the product that can be built.

As agents become more autonomous, the productivity frontier shifts from intelligence alone to intelligence × iteration speed. The best agents will generate more useful tokens, reason more, and perform more tool calls, tests, and revisions inside the same wall-clock budget.

This is why Kog optimizes single-request latency first, and why this preview runs at batch size 1. Large batches do matter and we will support them in production, but they answer a different question.

But what is limiting decode speed on GPUs?

Memory bandwidth is the primary bottleneck for fast token generation (and GPU nodes have plenty)

At batch size 1, autoregressive decoding is dominated by matrix-vector work. For each generated token, all the active weights of the model must move through the memory hierarchy inside the GPU, from HBM to compute processors. Thus, a first-order bound is:

tokens/s ≤ effective_memory_bandwidth
           / (β × active_weight_bytes + KV cache)

where β can be greater than one when tiles are reloaded or cache reuse is imperfect.

The key fact is that low-batch decode has very low arithmetic intensity. In FP16, a model weight occupies two bytes and contributes roughly one multiply-add (two FLOPs) which is about 1 FLOP/byte. FP8 raises it to ~2 FLOPs/byte; FP4 to ~4. However, modern AI GPUs expose hundreds of peak FLOPs per byte of HBM bandwidth. NVIDIA's H200, for example, claims a peak balance of roughly 400 FLOPs/byte. Thus, token generation speed is capped by memory bandwidth before being limited by FLOPS.

This is why Memory Bandwidth Utilization (MBU) is the central metric for single-request speed, not Model FLOP Utilization (MFU). MFU can still be improved by batching several requests together, which can however increase the latency experienced by each user as more KV cache data needs to be streamed inside the GPU.

For batch-size-1 decode, more memory bandwidth equals more tokens generated per second. The good news is that memory bandwidth of GPUs is already very high. An 8× NVIDIA H200 node exposes roughly 30.7 TB/s of effective aggregate memory bandwidth (taking 80% of the 4.8 TB/s theoretical per GPU as a realistic ceiling). An 8× AMD MI300X node reaches about 33.6 TB/s in practice (assuming 4.2 TB/s achievable per GPU).

Let's take a 2B-parameter dense model in FP16 as an example. It has roughly 4 GB of active weights, so if weights alone could be streamed perfectly (ignoring KV cache traffic and potential β reloads), the speed-of-light upper bounds would be:

8× H200: 30.7 TB/s ÷ 4 GB ≈ 7,700 tokens/s
8× MI300X: 33.6 TB/s ÷ 4 GB ≈ 8,400 tokens/s

Let's consider a few more examples: at batch size 1, the same speed results apply to a MoE with 4B active parameters in FP8; and a 32B-active-parameter MoE in FP4 would be bounded at ~2,000 tokens/s.

In a latency-first inference stack, a valid strategy is thus to parallelize inference on a full server node providing eight GPUs worth of HBM bandwidth.

It should also be noted that the next GPU generations (Rubin and MI450) coming in H2 2026 will provide about 4x higher memory bandwidth, thus allowing to reach the same speed for 4x bigger models, or with 4x fewer GPUs (potentially one or two instead of a full node). This will also help support bigger batch sizes at the same speed. At the end of this post, we'll dig a bit more on this topic to show that a decoding speed of thousands of tokens per second should be achievable on datacenter GPUs for current large state-of-the-art MoE models.

There is a catch, though. These bounds do not take into account non-GEMM operations stalls, intra-GPU synchronization, inter-GPU communication, instruction overhead, and so on. The key question is how continuously the system can stream the active model parameters through HBM and cache without interruptions. It turns out that making an 8-GPU server behave like a single continuous memory-streaming machine is, indeed, a hard problem.

Where standard inference stacks lose precious microseconds

At 3,000 tokens/s, the per-token budget is roughly 333 microseconds, including all layers, LM head and sampling. On a 25-layer model, spending just an extra 1 microsecond per layer consumes 7.5% of the time budget!

The usual abstraction stack — model graph logic written in a high-level language or framework like PyTorch or Triton, lowered into many kernels, scheduled by a CPU runtime, synchronized at kernel boundaries, and mediated by framework-level communication libraries — is flexible, facilitates maintainability and integration, and is great for general-purpose serving, including maximizing aggregated throughput at high batch sizes. This is the approach usually taken for models running on inference engines like vLLM, SGLang, and TensorRT-LLM. It is, however, poorly matched to a 333-microsecond token budget.

A simple launch-overhead calculation shows the problem. If a kernel launch and cleanup costs about 4.5 µs (as per our measurements on AMD MI300X), ten kernels per Transformer layer over twenty-five layers create 1,125 µs of overhead per token before any useful work, thus capping the achievable speed at ~890 tokens/s. Even just five aggressively fused kernels per layer still produce ~563 µs of overhead, capping speed around 1,780 tokens/s. And this is before taking into account the other sources of overhead, which compound on top of this.

Turning theoretical HBM bandwidth into useful model bandwidth is thus a matter of systematically identifying and killing the sources of microsecond loss:

Standard inference stacks	Microsecond losses	Kog implementation
Kernel boundaries	Launch, cleanup, cache write-back, scheduler round-trips add overhead and break memory streaming.	Persistent monokernel: one GPU-resident program for the whole decode path.
CPU scheduling and sampling	Host-side logic introduces costly GPU-CPU communication and execution delays.	Full GPU-resident logic including LM-head sampling on the critical execution path. Optional zero-overhead asynchronous CPU logic for output streaming and EOS detection.
Grid synchronization	Matmul, attention, normalization, sampling, and routing all require GPU-wide synchronization and communication, at a cost of several microseconds per operation.	Optimized topology-aware intra-GPU grid sync and AllGather/AllReduce primitives; ~600 ns barrier on MI300X for small payloads.
Inter-GPU collectives	Tensor parallelism inserts two or three AllReduce operations in every layer's critical path.	Optimized KCCL communication primitives with AllReduce latency under 3 µs; Delayed Tensor Parallelism (DTP) communication in Kog's Laneformer model architecture.
Unified memory topology	Unified memory is actually not physically uniform: cache, HBM, IOD chiplets, and XCD placement all affect latency.	Topology-aware memory accesses with IOD-aware buffer placement, polling, and synchronization.
Weight reloads	Imperfect cache management and tile reuse during MatMuls raise β.	Cache- and register-aware kernel with memory layout optimized for low batch sizes.
Non-GEMM work	Computations of softmax, norms, routing, sampling, etc. pause memory streaming.	Monokernel with fused prefetch overlapping across computational sections.

In a nutshell, standard inference stacks waste microseconds everywhere. That is where the available HBM bandwidth disappears.

The Kog stack co-designs the engine, the GPU code, and the model architecture to get these microseconds back

Inference systems are layered: a model on top of a runtime on top of GPU kernels. The model architecture constrains the communication schedule and the structure of the computational graph; the runtime controls scheduling and memory streaming; the GPU code decides whether synchronization, cache management, and topology are managed in a way that fits inside the budget. In existing inference engines, these layers are mostly tuned in isolation.

Kog recognizes the inter-dependencies of these three layers to their full extent, and co-designs them for maximum speed in the Kog Inference Engine.

That's why our critical decoding path does not rely on third-party frameworks, libraries, and abstractions (like PyTorch, Triton, CUTLASS, NCCL, ROCm CK, AITER, or RCCL). These are very valuable general-purpose tools, but our speed objective is narrower: batch-size-1 (or low batch size), full-node, low and medium active-parameter counts, with a budget of only a few hundred microseconds per token. The hot path is implemented in low-level, hand-crafted GPU code (CUDA with PTX inline assembly on NVIDIA, HIP with CDNA ISA inline assembly on AMD) and uses our own KCCL communication functions for collectives.

Here is a summary of some of our key innovations:

Monokernel runtime and optimized GPU code. Our token generation runs as one persistent GPU program, instead of a sequence of per-operation kernels. It decodes the whole sequence in one pass without interruption. This monokernel removes all kernel boundaries, eliminates host-side scheduling and CPU-side token sampling from the critical path, and lets us control synchronization, communication, prefetch, and execution order and scheduling much more tightly than a conventional multi-kernel runtime. This approach is much harder to implement than ordinary fusion, because the same static GPU program must cover MatMul, attention, normalization, routing, sampling, and communication, which have each different computational shapes, buffer and register allocation needs, and synchronization. However, once implemented successfully, useful memory streaming is no longer repeatedly stopped by kernel boundaries. Deep dive in the Kog monokernel blog post for more details on our low-level GPU engineering optimizations.
KCCL inter-GPU communications. A single request can use a full 8-GPU node only if the model is parallelized across GPUs. Standard tensor parallelism requires two (or three for a MoE model) all-reduce synchronizations in every layer; with enough layers, it's critical to avoid spending too many microseconds per collective. KCCL is Kog's custom collective-communication layer for this regime. Its objective is not peak aggregate bandwidth but predictable microsecond-scale latency that can be integrated into the monokernel schedule, keeping it under 3 µs where vendor libraries would spend ~8 µs. The code itself is tuned at the assembly level for each target GPU architecture, communication link characteristics, and buffer size.
Laneformer model architecture. Kog's Laneformer model architecture is an innovative design around how multi-GPU nodes actually move data. Its core innovation, Delayed Tensor Parallelism (DTP, explained in detail in our blog post), changes the dependency structure of tensor-parallel decoding so that cross-device communication can be overlapped with useful computation rather than blocking the critical path. The model architecture itself is shaped by the latency structure of multi-GPU decoding.

Our chiplet-topology work on the AMD MI300X GPU is worth discussing, as an example of our hardware-aware software design approach:

The problem: this GPU contains 8 XCDs (compute dies) placed on top of 4 I/O dies (IOD), and 8 HBM stacks behind a unified memory abstraction. 2 XCDs and 2 HBM stacks are attached to each IOD; communication with modules attached to a different IOD is subject to additional latency and a bandwidth bottleneck. The unified memory abstraction is useful for programming simplicity, but it hides non-uniform access paths: in our measurements, the physical route from an XCD to an HBM location changes latency enough to matter (up to 150 ns) and introduces skew between the compute units.
Our solution: for grid synchronization and intra-GPU collectives, we measured barrier latency per XCD and mapped it to the chiplet topology, recovered the physical-memory-address-to-IOD mapping, and used this knowledge to replicate memory buffers on HBM stacks at controlled locations so each XCD polls memory from an HBM stack attached to its own I/O die. The resulting barrier is about 600 ns and stable across compute units.

The same engineering approach applies on NVIDIA Hopper: at this speed, each GPU package is a specific physical system, not an abstract accelerator.

By digging into the low-level hardware machinery, and adjusting our inference engine to it, we can find spare microseconds that are impossible to reach when using higher-level languages, libraries and frameworks.

What we're launching today

Kog Inference Engine fixes the GPU inference stack to generate tokens on standard GPUs at speeds comparable to dedicated inference hardware

We open a tech preview of Kog Inference Engine's 3,000 tokens/s/req speed in a live playground running the Laneformer 2B model used in the above benchmark, with the same configuration on a single 8× MI300X node, at batch size 1.

Note that this preview is meant to make the speed observable, not to provide a frontier coding assistant. Our model scores 50% on the HumanEval coding benchmark, which is actually quite good for its size (Qwen2.5-Coder is at 43.9% for the 1.5B version and 52.4% for 3B), and shines when fine-tuned on specific SWE tasks. It uses vanilla autoregressive decoding on a 4096 sequence length (long context extension is under way to extend it to 128k). We pre-trained it on 6T tokens on the NVIDIA Nemotron v1 and v2 datasets, on a cluster of 256 H100 GPUs.

Importantly, we did not use other optimization tricks than the ones explained above: no quantization, no speculative decoding, no pruning, no early exit, no KV cache compression, etc. We do plan to implement this kind of low-hanging-fruit optimizations in our future roadmap, along with others, to facilitate support for larger models and batching at similar speeds (or just to increase the speed).

On a single 8x NVIDIA H200 node, our engine currently generates 2,100 tokens/s per request. We expect to match AMD GPU's speed in the near future.

Now, let's find out how we will scale this tech preview to accelerate the latest frontier AI models.

Scaling to large third-party MoE models

The next engineering step is to apply the same stack to larger third-party open-weight models (dense and MoE) with FP8/FP4 quantization and multi-token prediction techniques (like speculative decoding) when applicable.

Our scaling argument is built on active-parameter bytes moved in each forward pass, not total parameter count. For dense models, active parameters are essentially the full model. For MoEs, what matters is active parameters per generated token, which can be dramatically smaller than the total (numbers below are at batch size 1):

Qwen3-Coder-Next: 80B total, 3B active.
GPT-OSS-120B: 117B total, 5.1B active.
DeepSeek-V4-Flash: 284B total, 13B active.
Kimi-K2.6: 1.04T total, 32B active.
Qwen3-Coder-480B-A35B: 480B total, 35B active.
DeepSeek-V4-Pro: 1.6T total, 49B active.

The first-order bandwidth-only ceiling looks like this on a single 8-GPU node at 80% of theoretical aggregate bandwidth (numbers provided are output tokens per second):

Model (active params, precision)	8× H200 ~30.7 TB/s	8× MI300X ~33.6 TB/s)	8× B200 / MI355X ~51.2 TB/s	8× MI450 ~125.4 TB/s	8× Rubin ~140.8 TB/s
Qwen3-Coder-Next (3B, FP8)	~10,200	~11,200	~17,100	~41,800	~46,900
GPT-OSS-120B (5.1B, MXFP4/BF16)	~6,150	~6,730	~10,300	~25,100	~28,200
DeepSeek-V4-Flash (13B, MXFP4/FP8)	~3,250	~3,560	~5,420	~13,300	~14,900
Kimi-K2.6 (32B, INT4/BF16)	~915	~1,000	~1,520	~3,730	~4,190
Qwen3-Coder-480B-A35B (35B, FP8)	~880	~960	~1,460	~3,580	~4,020
DeepSeek-V4-Pro (49B, MXFP4/FP8)	~860	~940	~1,430	~3,500	~3,940

These are upper bounds, not guaranteed achievable production speeds per se. As discussed in the previous sections, real speeds need to take into account per-layer slowdowns due to kernel launches, KV-cache traffic, β, non-GEMM work, routing, synchronizations, inter-GPU collectives, etc. This is where Kog's inference engine shines compared to traditional inference stacks.

There is an elephant standing in the room, though: for third-party models, we cannot utilize Delayed Tensor Parallelism, since the model architecture is fixed. We need to use standard Tensor Parallelism, and thus pay a latency cost for inter-GPU all-reduce communications 3 times per layer. Fortunately, the speed of our KCCL collectives, combined with our monokernel design, allows us to continue streaming model weights to compute units and memory caches while GPUs communicate. This does not fully remove the impact of such communications, but it reduces it significantly (remember that we are memory-bound, not compute-bound: so even if compute is paused for a little while, we will be able to catch up very easily — the real limiting factor is the size of shared memory buffers, register files, and caches that are used to fetch model weights into).

Now, to predict real numbers, let's rely on the fact that our tech preview achieves ~36% MBU. Assuming conservatively we would not improve this number (although we strongly believe that we will), and leaving potential quantization or multi-token prediction tricks out of the equation, it means that the numbers in the above table should be divided by ~2.8 to provide a reasonable estimate of the real output speed we should expect on MoE frontier models by using our current techniques (in tokens/s):

Model (active params, precision)	8× H200 ~30.7 TB/s	8× MI300X ~33.6 TB/s	8× MI355X / B200-class ~51.2 TB/s	8× MI450 ~125.4 TB/s	8× Rubin ~140.8 TB/s
Qwen3-Coder-Next (3B, FP8)	~3,650	~4,000	~6,100	~14,900	~16,800
GPT-OSS-120B (5.1B, MXFP4/BF16)	~2,200	~2,400	~3,660	~8,970	~10,100
DeepSeek-V4-Flash (13B, MXFP4/FP8)	~1,160	~1,270	~1,940	~4,740	~5,320
Kimi-K2.6 (32B, INT4/BF16)	~325	~355	~545	~1,330	~1,500
Qwen3-Coder-480B-A35B (35B, FP8)	~315	~345	~520	~1,280	~1,440
DeepSeek-V4-Pro (49B, MXFP4/FP8)	~305	~335	~510	~1,250	~1,410

Of course these are ballpark estimates, and real numbers will differ. But the core idea holds.

As GPU HBM bandwidth grows and as the Kog stack — runtime, kernel, collectives, etc. — matures, we expect the speed of large frontier MoE models to move into the 1,000–5,000 tokens/s/request band on standard datacenter GPUs.

Conclusion

Dedicated inference hardware established single-request generation speed as a distinct infrastructure category that will become increasingly important with the rise of autonomous AI agents.

Until now, standard datacenter GPUs have not been able to compete in this category; not because of their hardware, but because of how the software inference stack has been built on top of them.

Our public preview demonstrates that a standard 8-GPU node can generate 3,000 output tokens per second per request on a 2B coding model at batch size 1, with no quantization or speculative decoding. We achieved that by treating the persistent runtime, low-level GPU code, and model architecture as one system.

The broader takeaway is that this is not limited to a small custom model. As available HBM bandwidth grows and the Kog stack matures, we expect the same kind of performance to carry over to the large open-weight MoEs at the frontier of AI agents today.

Explore these links to dig more

Test our speed in the Kog Playground
Blog post on Delayed Tensor Parallelism from our LLM architecture research team
Technical deep dive from our GPU Engineering team on our monokernel runtime, including some of our low-level GPU optimizations for the AMD MI300X GPU.
Design Partner Program for teams building coding agents, app-generation systems, or other agentic workflows where iteration speed is already a competitive bottleneck: we'll be happy to talk, please contact us through our website.

About Kog

Kog is a Paris-based AI infrastructure startup building a real-time inference engine for AI agents with innovative low-level GPU Engineering and LLM architecture research. Founded in 2023 by Gaël Delalleau — an École Polytechnique engineer whose career spans cybersecurity research and high-performance GPU work — Kog operates from Paris with a team of 11, including 10 engineers and researchers (5 PhDs).

Kog has raised $5M from Varsity VC and BPI France's Deep Tech Program, and was awarded the French Tech 2030 label in October 2025, a French government recognition granted to select national deep-tech companies contributing to strategic sectors.

Hacker Times