But I have to say that the comparison is not really fair. Comparison is done with a 2 B model vs frontier models that are likely 100s of times larger. Also taalas with their 15000 tok/s inference are suspiciously missing from the comparison.
We need to see the comparison with this framework and useful models, which at present seems to mean ~30 B.
We strived to be fair as possible in the benchmark, but it's indeed not perfect. Taalas should have been added in the dedicated hardware section, even though they use 3-bit quantization when we are on FP16 (to be fair in both directions) and they burn the model directly on the card.
Our tech preview is about the speed (hence the small dense model, it was easier to implement).
The math checks out though to allow support for large frontier MoE models at similar speeds: - At batch size 1, GPT-OSS-120B has 5.1B active parameters - in FP8, it's in the same size ballpark than our 2B model in FP16 (5.1 GB vs 4GB). - DeepSeek V4 Flash has 13B in mixed FP4/FP8, so let's say ballpark around 3x bigger than 4GB - so in theory we could reach >1,000 tok/s on it with MI300X/H200 and up to 4k on next generation GPUs.
Check out the math at the end of our blog post:
https://blog.kog.ai/real-time-llm-inference-on-standard-gpus...
The demo is very impressive!
disclaimer: I've known the founder for a while, as legitimate as it gets in deep tech, real years of research and engineering behind this, not vaporware
I have been lamenting for a while that the memory-bandwidth <-> tps relationship was pretty much working for small models on consumer cards, but not at all on datacenter hardware.
It's great to see that with proper care on the inference engine implementation the relationship can be restored.
Monokernel deep dive (GPU Engineering): http://blog.kog.ai/building-a-single-kernel-latency-optimize...
Delayed Tensor Parallelism (research): http://blog.kog.ai/delayed-tensor-parallelism-for-faster-tra...
To try the speed on the playground: http://playground.kog.ai
For new open weights models, will you need to adapt model code and optimization for your inference engine by hand?
It's true that BS=1 is king when it comes to agentic workflows, however these kinds of system serve multiple requests concurrently with dynamic batching. Do you think it will scale as well ?
Any plans to release it open source?
Congratz again for the release
I guess with 1B or 500M model inference would be even faster?
each time getting 3300+ tps.
That means Jensen can add another 30 times faster when comparing Rubin to Blackwell without having to actually do anything.
Hopefully that means he won't have any problem to make another 150 billion in profit in the next year.
Sorry for the sarcasm. Looks like interesting work.
Feels like a preview of the future
For instant code generatio, 400-500 tok/s should be sufficient, though most frontier models give us closer to 70 tok/s.
I am 100% all about using local models instead of sending someone else all my data and paying for the privilege of doing so, this article is misleading.
I can get a 27b model to kick out 40 tok/s on 16 gb vram. This is the area ripe for development.
If you can’t connect a monitor, it isn’t a standard GPU, at least not in the way people have spoken about GPUs until a few years ago.
I haven't read the article at the moment and I will try to read them hopefully but I wish to ask a question regarding, can this approach be done for say trillion or large parameter models as well or is there some wall which gets hit that makes it valuable for only smaller parameter model.
That being said, its still really incredible because in future, because these small models are really getting good for many use cases and speed becomes their bottleneck, with greater speeds at consumer hardware, I think its gonna be amazing work!
they seem to think it scales up because theyre shortening the stack.
Our view is that MTP / speculative decoding could help getting a X multiplier (X = 2 to 6) on the tokens per second speed we currently achieve.
We are a bit greedy, we want to stack optimizations on top of each other to get the maximum speed possible.
It involves additional compute to verify the predicted tokens during the forward pass (it's like a small batch), which should be totally doable for dense models, and will be more tricky for MoEs because it could mean activating more experts and thus more active parameters.
I feel like if they got DeepSeek V4 Flash and Pro running on their hardware, even if at less than 1000 tok/s, they’d still be crushing it with any subscription they’d provide, given how generous their token limits were.
About model performance, we plan to support the latest frontier models (this tech preview is about the speed of the engine)
The math checks out though to allow support for large frontier MoE models at similar speeds.
At batch size 1, GPT-OSS-120B has 5.1B active parameters - in FP8, it's in the same size ballpark than our 2B model in FP16 (5.1 GB vs 4GB).
DeepSeek V4 Flash has 13B in mixed FP4/FP8.
Check out the math at the end of our blog post: https://blog.kog.ai/real-time-llm-inference-on-standard-gpus...
But joke aside, I think we don't even know yet what is possible if you hit very fast very high token / second numbers if your whole ecosystem behind it can handle it.
You could literaly implement the same solution 100x and benchmark all of them and get only the best result.
You could build and architecture a whole stack in parallel.
You could do massive thinking token / chain of thought.
You could let the LLM analyse everything around you while you type. Like it could tell you that this might create a bug in a different file and why.
We could start doing some type of monte-carlo search with this.
To answer your questions:
- yes, we rewrite the whole model code (while keeping the same logic) in CUDA/HIP and assembly, in order to optimize by hand for each GPU type. It's quite tedious for sure, but I guess this is the price to pay to get this kind of results.
- the batching question is a great one. In agentic systems, there is probably a trade-off between sequential thinking/iterations vs parallel exploration of multiple solutions. Also, there could just be multiple independent tasks running in parallel, depending on the use case.
We plan to support a small amount of batching, but it quickly becomes a trade-off vs speed. Pick one for your use case, I guess.
Also to consider: because we answer requests much faster, we are also able to process lots of them without needing high batches - and scaling on multiple nodes is possible.
- open sourcing: maybe, maybe not. I'm still undecided on this. We are a small startup and I'm told that giving our IP away might be shooting ourselves in the feet. On the other side, I think it could be of great benefit to the community and for us... we'll see
The last section of the article lays out the scaling laws that apply when porting this approach to another model. In a nutshell, DeepSeek V4 Pro with 49B active params is close to the upper bound.
Also worth noting that our results are currently for standard datacenter GPUs. On consumer hardware, though the same low-level optimization approach applies, the bandwidth limitations will cap the achievable speed.
Our process has been, and will continue to be, a sequence of (tedious) R&D experiments where the GPU never behaves as expected when pushed to its limits in ways no-one really tested before (I still have nightmares of the L3 cache cross-IOD bottlenecks on MI300X).
IMHO, we did solve the multi-GPU memory bandwidth scaling problem, and thus the linear scaling of the size of the model towards infinity. But the main difficulties will come from keeping the speed, with steady and continuous memory streaming, while implementing the much more complex architecture of modern frontier MoEs (attention compression tricks, hash layers, routing logic, etc.)
The authors' approach also encompasses multi-node approaches that won't apply easily to consumer inference since consumer GPUs have very low-performance interconnects, hence why layer parallelism is usually favored. (But that doesn't work very well with the monokernel approach, since it involves running distinct logic on each separate GPU. It also doesn't speed up single inference, though you can get that throughput back by pipelining small minibatches.)
Sorry for the confusion
DTP is something we built for our roadmap in order to get to extremely high speeds (like 10k+ tokens/s). When the budget is under 10 µs per layer, any little overhead matters.
For 1k to 5k tokens/s, regular TP still works because we are able to optimize the inter-GPU all-reduce collectives at under 3 µs, which allows to continue streaming model weights in shared memory, registers and caches while GPUs exchange data.
In contrast, not enterprise GPUs that cost as much as a car.
Did the article headline not say Standard GPU?
Edit: I just tried a 4B model on a RTX Pro 6000, getting ~500 tok/s with llama.cpp not even trying to optimize or change anything, just default settings. I'm sure with vLLM it'd be a lot faster already, still before manually tuning configs. I wouldn't call that card "Standard GPU" either FWIW, but it makes the claimed performance numbers feel not as exciting, especially given the hardware they were using.
That's basically antirez's DS4 and it works pretty well because there are few leading models and few hardware platforms (Apple, GB10, Strix Halo) that are worth using.
- model size: 2B is just for this preview (it was faster to implement), our article explains how we expect to support large frontier MoE at 1,000 to 5,000 tokens/s
- reaching 500 tok/s, or even up to ~1,000 tok/s, on a consumer GPU card is possible with existing inference engines like vLLM. But there is a ceiling.
The hard part comes we you try to be faster than that: these frameworks won't scale higher just by adding GPUs or using faster GPUs. There is a "glass ceiling" due to microseconds lost everywhere in the stack (grid syncs, inter-GPU comms, kernel launches, CPU sampling, etc.).
All our work at Kog is about removing these bottlenecks.
Inference
Today, Kog AI launches a tech preview of the Kog Inference Engine (KIE): 3,000 output tokens/s per request on 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200 (FP16, no speculative decoding). This preview runs a 2B model, with support for large third-party MoE models coming next at similar speeds.

(see below for full benchmark details)
TL;DR: we show that AI inference on GPUs can be super-fast, reaching the speed regime of dedicated inference hardware cards when optimizing the whole software stack with architecture/engine/kernel co-design. Test the speed in our live coding playground: playground.kog.ai.
This post explains why optimizing for single-request LLM decoding speed is important for AI agents; why it's primarily a memory-bandwidth maximization problem, not a FLOPS one; why standard datacenter GPU hardware has a much higher decoding-speed ceiling than current inference stacks expose due to software bottlenecks; and how that ceiling can be reached (even on large MoE models) by co-designing the model architecture, runtime, and low-level GPU code as a single latency-optimized pipeline.
Our public tech preview is about proving that extremely fast single-request decoding is possible on the standard datacenter GPUs enterprises already own — including AI labs and sovereign-AI buyers. The limiting factor has been that existing inference software stacks are not optimized for this type of workload. Opening the GPU path could deliver that speed without the lock-in of proprietary silicon.
You can test the speed of our 2B coding model today. It's small and not a frontier model (we've been focused on speed rather than scale), though still quite capable when fine-tuned for specific software engineering tasks.
Inference benchmarks typically conflate three quantities. Aggregate throughput (total tokens generated per second across all users) measures server utilization and rewards large batches. Time to first token measures prefill latency. Decode speed per request measures token generation speed and defines how long one user waits before receiving the full response. That last one governs every long serial interaction, and it's what AI agents are bottlenecked on.
Agentic software engineering is a sequential loop: inspect, plan, edit, test, revise. Each step depends on the previous one. Tool time sometimes dominates, as tests have to run and web pages have to load, but the generation-heavy steps (planning, code writing, trace analysis, debugging, refactoring) set the loop rate. And reasoning tokens compound on top.
The numbers translate directly into product and user experience. If an agent needs to generate 50,000 tokens in a workflow, 100 tokens/s is roughly eight minutes; 3,000 tokens/s is under twenty seconds. The difference changes the product that can be built.
As agents become more autonomous, the productivity frontier shifts from intelligence alone to intelligence × iteration speed. The best agents will generate more useful tokens, reason more, and perform more tool calls, tests, and revisions inside the same wall-clock budget.
This is why Kog optimizes single-request latency first, and why this preview runs at batch size 1. Large batches do matter and we will support them in production, but they answer a different question.
But what is limiting decode speed on GPUs?
At batch size 1, autoregressive decoding is dominated by matrix-vector work. For each generated token, all the active weights of the model must move through the memory hierarchy inside the GPU, from HBM to compute processors. Thus, a first-order bound is:
tokens/s ≤ effective_memory_bandwidth
/ (β × active_weight_bytes + KV cache)
where β can be greater than one when tiles are reloaded or cache reuse is imperfect.
The key fact is that low-batch decode has very low arithmetic intensity. In FP16, a model weight occupies two bytes and contributes roughly one multiply-add (two FLOPs) which is about 1 FLOP/byte. FP8 raises it to ~2 FLOPs/byte; FP4 to ~4. However, modern AI GPUs expose hundreds of peak FLOPs per byte of HBM bandwidth. NVIDIA's H200, for example, claims a peak balance of roughly 400 FLOPs/byte. Thus, token generation speed is capped by memory bandwidth before being limited by FLOPS.
This is why Memory Bandwidth Utilization (MBU) is the central metric for single-request speed, not Model FLOP Utilization (MFU). MFU can still be improved by batching several requests together, which can however increase the latency experienced by each user as more KV cache data needs to be streamed inside the GPU.
For batch-size-1 decode, more memory bandwidth equals more tokens generated per second. The good news is that memory bandwidth of GPUs is already very high. An 8× NVIDIA H200 node exposes roughly 30.7 TB/s of effective aggregate memory bandwidth (taking 80% of the 4.8 TB/s theoretical per GPU as a realistic ceiling). An 8× AMD MI300X node reaches about 33.6 TB/s in practice (assuming 4.2 TB/s achievable per GPU).
Let's take a 2B-parameter dense model in FP16 as an example. It has roughly 4 GB of active weights, so if weights alone could be streamed perfectly (ignoring KV cache traffic and potential β reloads), the speed-of-light upper bounds would be:
Let's consider a few more examples: at batch size 1, the same speed results apply to a MoE with 4B active parameters in FP8; and a 32B-active-parameter MoE in FP4 would be bounded at ~2,000 tokens/s.
In a latency-first inference stack, a valid strategy is thus to parallelize inference on a full server node providing eight GPUs worth of HBM bandwidth.
It should also be noted that the next GPU generations (Rubin and MI450) coming in H2 2026 will provide about 4x higher memory bandwidth, thus allowing to reach the same speed for 4x bigger models, or with 4x fewer GPUs (potentially one or two instead of a full node). This will also help support bigger batch sizes at the same speed. At the end of this post, we'll dig a bit more on this topic to show that a decoding speed of thousands of tokens per second should be achievable on datacenter GPUs for current large state-of-the-art MoE models.
There is a catch, though. These bounds do not take into account non-GEMM operations stalls, intra-GPU synchronization, inter-GPU communication, instruction overhead, and so on. The key question is how continuously the system can stream the active model parameters through HBM and cache without interruptions. It turns out that making an 8-GPU server behave like a single continuous memory-streaming machine is, indeed, a hard problem.
At 3,000 tokens/s, the per-token budget is roughly 333 microseconds, including all layers, LM head and sampling. On a 25-layer model, spending just an extra 1 microsecond per layer consumes 7.5% of the time budget!
The usual abstraction stack — model graph logic written in a high-level language or framework like PyTorch or Triton, lowered into many kernels, scheduled by a CPU runtime, synchronized at kernel boundaries, and mediated by framework-level communication libraries — is flexible, facilitates maintainability and integration, and is great for general-purpose serving, including maximizing aggregated throughput at high batch sizes. This is the approach usually taken for models running on inference engines like vLLM, SGLang, and TensorRT-LLM. It is, however, poorly matched to a 333-microsecond token budget.
A simple launch-overhead calculation shows the problem. If a kernel launch and cleanup costs about 4.5 µs (as per our measurements on AMD MI300X), ten kernels per Transformer layer over twenty-five layers create 1,125 µs of overhead per token before any useful work, thus capping the achievable speed at ~890 tokens/s. Even just five aggressively fused kernels per layer still produce ~563 µs of overhead, capping speed around 1,780 tokens/s. And this is before taking into account the other sources of overhead, which compound on top of this.
Turning theoretical HBM bandwidth into useful model bandwidth is thus a matter of systematically identifying and killing the sources of microsecond loss:
| Standard inference stacks | Microsecond losses | Kog implementation |
|---|---|---|
| Kernel boundaries | Launch, cleanup, cache write-back, scheduler round-trips add overhead and break memory streaming. | Persistent monokernel: one GPU-resident program for the whole decode path. |
| CPU scheduling and sampling | Host-side logic introduces costly GPU-CPU communication and execution delays. | Full GPU-resident logic including LM-head sampling on the critical execution path. Optional zero-overhead asynchronous CPU logic for output streaming and EOS detection. |
| Grid synchronization | Matmul, attention, normalization, sampling, and routing all require GPU-wide synchronization and communication, at a cost of several microseconds per operation. | Optimized topology-aware intra-GPU grid sync and AllGather/AllReduce primitives; ~600 ns barrier on MI300X for small payloads. |
| Inter-GPU collectives | Tensor parallelism inserts two or three AllReduce operations in every layer's critical path. | Optimized KCCL communication primitives with AllReduce latency under 3 µs; Delayed Tensor Parallelism (DTP) communication in Kog's Laneformer model architecture. |
| Unified memory topology | Unified memory is actually not physically uniform: cache, HBM, IOD chiplets, and XCD placement all affect latency. | Topology-aware memory accesses with IOD-aware buffer placement, polling, and synchronization. |
| Weight reloads | Imperfect cache management and tile reuse during MatMuls raise β. | Cache- and register-aware kernel with memory layout optimized for low batch sizes. |
| Non-GEMM work | Computations of softmax, norms, routing, sampling, etc. pause memory streaming. | Monokernel with fused prefetch overlapping across computational sections. |
In a nutshell, standard inference stacks waste microseconds everywhere. That is where the available HBM bandwidth disappears.
Inference systems are layered: a model on top of a runtime on top of GPU kernels. The model architecture constrains the communication schedule and the structure of the computational graph; the runtime controls scheduling and memory streaming; the GPU code decides whether synchronization, cache management, and topology are managed in a way that fits inside the budget. In existing inference engines, these layers are mostly tuned in isolation.
Kog recognizes the inter-dependencies of these three layers to their full extent, and co-designs them for maximum speed in the Kog Inference Engine.
That's why our critical decoding path does not rely on third-party frameworks, libraries, and abstractions (like PyTorch, Triton, CUTLASS, NCCL, ROCm CK, AITER, or RCCL). These are very valuable general-purpose tools, but our speed objective is narrower: batch-size-1 (or low batch size), full-node, low and medium active-parameter counts, with a budget of only a few hundred microseconds per token. The hot path is implemented in low-level, hand-crafted GPU code (CUDA with PTX inline assembly on NVIDIA, HIP with CDNA ISA inline assembly on AMD) and uses our own KCCL communication functions for collectives.
Here is a summary of some of our key innovations:
Our chiplet-topology work on the AMD MI300X GPU is worth discussing, as an example of our hardware-aware software design approach:
The same engineering approach applies on NVIDIA Hopper: at this speed, each GPU package is a specific physical system, not an abstract accelerator.
By digging into the low-level hardware machinery, and adjusting our inference engine to it, we can find spare microseconds that are impossible to reach when using higher-level languages, libraries and frameworks.

Kog Inference Engine fixes the GPU inference stack to generate tokens on standard GPUs at speeds comparable to dedicated inference hardware
We open a tech preview of Kog Inference Engine's 3,000 tokens/s/req speed in a live playground running the Laneformer 2B model used in the above benchmark, with the same configuration on a single 8× MI300X node, at batch size 1.
Note that this preview is meant to make the speed observable, not to provide a frontier coding assistant. Our model scores 50% on the HumanEval coding benchmark, which is actually quite good for its size (Qwen2.5-Coder is at 43.9% for the 1.5B version and 52.4% for 3B), and shines when fine-tuned on specific SWE tasks. It uses vanilla autoregressive decoding on a 4096 sequence length (long context extension is under way to extend it to 128k). We pre-trained it on 6T tokens on the NVIDIA Nemotron v1 and v2 datasets, on a cluster of 256 H100 GPUs.
Importantly, we did not use other optimization tricks than the ones explained above: no quantization, no speculative decoding, no pruning, no early exit, no KV cache compression, etc. We do plan to implement this kind of low-hanging-fruit optimizations in our future roadmap, along with others, to facilitate support for larger models and batching at similar speeds (or just to increase the speed).
On a single 8x NVIDIA H200 node, our engine currently generates 2,100 tokens/s per request. We expect to match AMD GPU's speed in the near future.
Now, let's find out how we will scale this tech preview to accelerate the latest frontier AI models.
The next engineering step is to apply the same stack to larger third-party open-weight models (dense and MoE) with FP8/FP4 quantization and multi-token prediction techniques (like speculative decoding) when applicable.
Our scaling argument is built on active-parameter bytes moved in each forward pass, not total parameter count. For dense models, active parameters are essentially the full model. For MoEs, what matters is active parameters per generated token, which can be dramatically smaller than the total (numbers below are at batch size 1):
The first-order bandwidth-only ceiling looks like this on a single 8-GPU node at 80% of theoretical aggregate bandwidth (numbers provided are output tokens per second):
| Model (active params, precision) | 8× H200 ~30.7 TB/s | 8× MI300X ~33.6 TB/s) | 8× B200 / MI355X ~51.2 TB/s | 8× MI450 ~125.4 TB/s | 8× Rubin ~140.8 TB/s |
|---|---|---|---|---|---|
| Qwen3-Coder-Next (3B, FP8) | ~10,200 | ~11,200 | ~17,100 | ~41,800 | ~46,900 |
| GPT-OSS-120B (5.1B, MXFP4/BF16) | ~6,150 | ~6,730 | ~10,300 | ~25,100 | ~28,200 |
| DeepSeek-V4-Flash (13B, MXFP4/FP8) | ~3,250 | ~3,560 | ~5,420 | ~13,300 | ~14,900 |
| Kimi-K2.6 (32B, INT4/BF16) | ~915 | ~1,000 | ~1,520 | ~3,730 | ~4,190 |
| Qwen3-Coder-480B-A35B (35B, FP8) | ~880 | ~960 | ~1,460 | ~3,580 | ~4,020 |
| DeepSeek-V4-Pro (49B, MXFP4/FP8) | ~860 | ~940 | ~1,430 | ~3,500 | ~3,940 |
These are upper bounds, not guaranteed achievable production speeds per se. As discussed in the previous sections, real speeds need to take into account per-layer slowdowns due to kernel launches, KV-cache traffic, β, non-GEMM work, routing, synchronizations, inter-GPU collectives, etc. This is where Kog's inference engine shines compared to traditional inference stacks.
There is an elephant standing in the room, though: for third-party models, we cannot utilize Delayed Tensor Parallelism, since the model architecture is fixed. We need to use standard Tensor Parallelism, and thus pay a latency cost for inter-GPU all-reduce communications 3 times per layer. Fortunately, the speed of our KCCL collectives, combined with our monokernel design, allows us to continue streaming model weights to compute units and memory caches while GPUs communicate. This does not fully remove the impact of such communications, but it reduces it significantly (remember that we are memory-bound, not compute-bound: so even if compute is paused for a little while, we will be able to catch up very easily — the real limiting factor is the size of shared memory buffers, register files, and caches that are used to fetch model weights into).
Now, to predict real numbers, let's rely on the fact that our tech preview achieves ~36% MBU. Assuming conservatively we would not improve this number (although we strongly believe that we will), and leaving potential quantization or multi-token prediction tricks out of the equation, it means that the numbers in the above table should be divided by ~2.8 to provide a reasonable estimate of the real output speed we should expect on MoE frontier models by using our current techniques (in tokens/s):
| Model (active params, precision) | 8× H200 ~30.7 TB/s | 8× MI300X ~33.6 TB/s | 8× MI355X / B200-class ~51.2 TB/s | 8× MI450 ~125.4 TB/s | 8× Rubin ~140.8 TB/s |
|---|---|---|---|---|---|
| Qwen3-Coder-Next (3B, FP8) | ~3,650 | ~4,000 | ~6,100 | ~14,900 | ~16,800 |
| GPT-OSS-120B (5.1B, MXFP4/BF16) | ~2,200 | ~2,400 | ~3,660 | ~8,970 | ~10,100 |
| DeepSeek-V4-Flash (13B, MXFP4/FP8) | ~1,160 | ~1,270 | ~1,940 | ~4,740 | ~5,320 |
| Kimi-K2.6 (32B, INT4/BF16) | ~325 | ~355 | ~545 | ~1,330 | ~1,500 |
| Qwen3-Coder-480B-A35B (35B, FP8) | ~315 | ~345 | ~520 | ~1,280 | ~1,440 |
| DeepSeek-V4-Pro (49B, MXFP4/FP8) | ~305 | ~335 | ~510 | ~1,250 | ~1,410 |
Of course these are ballpark estimates, and real numbers will differ. But the core idea holds.
As GPU HBM bandwidth grows and as the Kog stack — runtime, kernel, collectives, etc. — matures, we expect the speed of large frontier MoE models to move into the 1,000–5,000 tokens/s/request band on standard datacenter GPUs.
Dedicated inference hardware established single-request generation speed as a distinct infrastructure category that will become increasingly important with the rise of autonomous AI agents.
Until now, standard datacenter GPUs have not been able to compete in this category; not because of their hardware, but because of how the software inference stack has been built on top of them.
Our public preview demonstrates that a standard 8-GPU node can generate 3,000 output tokens per second per request on a 2B coding model at batch size 1, with no quantization or speculative decoding. We achieved that by treating the persistent runtime, low-level GPU code, and model architecture as one system.
The broader takeaway is that this is not limited to a small custom model. As available HBM bandwidth grows and the Kog stack matures, we expect the same kind of performance to carry over to the large open-weight MoEs at the frontier of AI agents today.
Kog is a Paris-based AI infrastructure startup building a real-time inference engine for AI agents with innovative low-level GPU Engineering and LLM architecture research. Founded in 2023 by Gaël Delalleau — an École Polytechnique engineer whose career spans cybersecurity research and high-performance GPU work — Kog operates from Paris with a team of 11, including 10 engineers and researchers (5 PhDs).
Kog has raised $5M from Varsity VC and BPI France's Deep Tech Program, and was awarded the French Tech 2030 label in October 2025, a French government recognition granted to select national deep-tech companies contributing to strategic sectors.