We were the first to introduce post-rotation distribution-aware quantization in 2021. This was later implemented in many fields, including federated learning, vector retrieval, databases, inference engines, and KV-cache.
It would be appropriate to receive credit for this. Furthermore, it is baffling to see the name "TurboQuant" repeated in this context, considering the many works published from 2021 onwards.
The blog post mentioned above essentially guides you through EDEN quantization but ultimately settles on a sub-optimal MSE-minimizing version and an unbiasing trick. This trick often costs a full bit more than DRIVE/EDEN requires to achieve the same results using the unbiasing scale shown in the original 2021 paper.
Maybe we won't need as many data centers and as much power as we thought. Maybe we can run more powerful models locally.
`vllm.model_executor.layers.quantization.turboquant`
> The technique implemented here consists of the scalar case of the HIGGS quantization method (Malinovskii et al., "Pushing the Limits of Large Language Model Quantization via the Linearity Theorem", NAACL 2025; preprint arXiv:2411.17525): rotation + optimized grid + optional re-normalization, applied to KV cache compression. A first application of this approach to KV-cache compression is in "Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models" (Shutova et al., ICML 2025; preprint arXiv:2501.19392). Both these references pre-date the TurboQuant paper (Zandieh et al., ICLR 2026).
(*hopefully I didn't misunderstand the situation)
"This note clarifies the relationship between the recent TurboQuant work and the earlier DRIVE (NeurIPS 2021) and EDEN (ICML 2022) schemes. DRIVE is a 1-bit quantizer that EDEN extended to any bits per coordinate; we refer to them collectively as EDEN. First, TurboQuant is a special case of EDEN obtained by fixing EDEN's scalar scale parameter to . EDEN supports both biased and unbiased quantization, each optimized by a different (chosen via methods described in the EDEN works). The fixed choice used by TurboQuant is generally suboptimal, although the optimal for biased EDEN converges to as the dimension grows; accordingly TurboQuant approaches EDEN's behavior for large . Second, TurboQuant combines a biased -bit EDEN step with an unbiased 1-bit QJL quantization of the residual. It is suboptimal in three ways: (1) its -bit step uses the suboptimal ; (2) its 1-bit unbiased residual quantization has worse MSE than (unbiased) 1-bit EDEN; (3) chaining a biased -bit step with a 1-bit unbiased residual step is inferior to unbiasedly quantizing the input directly with -bit EDEN. Third, some of the analysis in the TurboQuant work mirrors that of the EDEN works: both exploit the connection between random rotations and the shifted Beta distribution, use the Lloyd-Max algorithm, and note that Randomized Hadamard Transforms can replace uniform random rotations. Experiments support these claims: biased EDEN (with optimized ) is more accurate than TurboQuant, and unbiased EDEN is markedly more accurate than TurboQuant, often by more than a bit (e.g., 2-bit EDEN beats 3-bit TurboQuant). We also repeat all accuracy experiments from the TurboQuant paper, showing that EDEN outperforms it in every setup we have tried."
Unfortunately V4 is not trained for most real world usage, it is mainly for world general knowledge.
The future is bright for local AI.
I thought the principal consequence of these KV cache optimisations was letting you run more simultaneous inferences on the same model with the same memory. It doesn’t let you store more model. In some sense that puts local LLM usage at a further disadvantage to inference done in a hyperscaler’s data center.
For example, TurboQuant makes use of QJL (quantized Johnson Lindenstrauss transformations). One of the first papers to characterize the QJL and in fact the rate distortion tradeoff for quantized matrix multiplication in general is "Optimal Quantization for Matrix Multiplication" (https://arxiv.org/abs/2410.13780) by Ordentlich and Polyanskiy.
There is also a more accessible survey paper around quantized matrix multiplication called "High-Rate Quantized Matrix Multiplication: Theory and Practice" (https://arxiv.org/abs/2601.17187), by the same authors.
TurboQuant cites none of them.
But at the end of the day it is just vanilla HTML, CSS and JS without anything fancy :D MathJax 3 was used to render math stuff.
While the aesthetic doesn't spark joy for me, the overall execution is great, the presentation flow and interactive boxes are very nice.
(In any case, I want to emphasize that TurboQuant quantizer is a private case of EDEN)
So shrinking that by 6x (from fp16), would be big win for larger models. True, while TurboQuant can also be applied to model weights, it won't save size over q4 compression, but will have better accuracy.
Edits: Better context
The attribution is thin, the “6x compression” headline is not clearly separated from prior KV-cache quantization baselines like KIVI, and the RaBitQ comparison is hard to take seriously: single-core CPU for the baseline, A100 GPU for TurboQuant. It is comparing apples-to-datacenter. Worse, there are also public OpenReview comments saying that even the reported accuracy results are not reproducible.
Hard to believe this is the standard for something being promoted as a breakthrough. If this came from a random startup blog, people would be much harsher about it.
The quantizer in TurboQuant is EDEN quantization (2021) applied to the KV-cache. It is neither a novel quantizer nor an improvement in quantization techniques.
In DRIVE/EDEN, we already introduced the version used in "TurboQuant"'s paper and suggested an optimal scale configurations which are better in both mse-minimizing and unbiased scenarios.
I would strongly recommend exploring that option, renting an RTX 5090 for an evening of image generation for a dollar or two is way more fun then trying to jam big models on little cards. Just take some time to create a reasonable, scripted, deployment workflow for when you create a fresh instance.
Both EDEN and its 1-bit variant have been implemented in PyTorch, JAX, and TensorFlow across numerous open-source libraries and are used in various applications. I am currently writing a blog post that will document these in detail.
EDEN defines a scale parameter, S, for which we suggest specific optimal values for both biased and unbiased versions. As shown in the note I shared, these values lead to clear empirical improvements. Consequently, users who rely on the less optimal S value and the unbiasing method popularized by TurboQuant will generally see inferior results compared to those using EDEN with the optimal scale values suggested in our original papers.
TurboQuant: A First-Principles Walkthrough
Modern language models store large tables of high-dimensional vectors: KV caches, embeddings, attention keys. TurboQuant compresses each coordinate of these vectors to 2–4 bits with provably near-optimal distortion, no memory overhead for scale factors, and no training or calibration. This page explains how it works.
[
DRIVE: One-bit Distributed Mean Estimation
Vargaftik, Ben-Basat, Portnoy, Mendelson, Ben-Itzhak, Mitzenmacher · NeurIPS 2021 · arXiv:2105.08339
](https://arxiv.org/abs/2105.08339)[
EDEN: Communication-Efficient and Robust Distributed Mean Estimation for Federated Learning
Vargaftik, Ben-Basat, Portnoy, Mendelson, Ben-Itzhak, Mitzenmacher · ICML 2022 · arXiv:2108.08842
](https://arxiv.org/abs/2108.08842)[
QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead
Zandieh, Daliri, Han · 2024 · arXiv:2406.03482
](https://arxiv.org/abs/2406.03482)[
PolarQuant: Quantizing KV Caches with Polar Transformation
Han, Kacham, Karbasi, Mirrokni, Zandieh · 2025 · arXiv:2502.02617
](https://arxiv.org/abs/2502.02617)[
TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate
Zandieh, Daliri, Hadian, Mirrokni · 2025 · arXiv:2504.19874
](https://arxiv.org/abs/2504.19874)[
A Note on TurboQuant and the Earlier DRIVE/EDEN Line of Work
Portnoy et al. · 2026 · arXiv:2604.18555
](https://arxiv.org/abs/2604.18555)
The single load-bearing idea: in high dimensions, a random rotation turns every input vector into one whose coordinates follow a known fixed distribution. A codebook designed once for that distribution can then be reused for every input. Everything else on this page is the construction that puts this observation to work.
This construction was introduced by DRIVE (Vargaftik et al., NeurIPS 2021) for one-bit federated mean estimation, and generalized to b bits per coordinate by EDEN (Vargaftik et al., ICML 2022). TurboQuant (Zandieh et al., 2025) is the same construction with the per-vector scale parameter fixed to a constant, repackaged for KV-cache compression and inner-product retrieval. The mapping from each idea on this page to its source paper is in §0.9.
§0 · Primer: jargon decoder
Each mini-demo below covers one concept used later. Skip the ones you already know.
§0.1 · Vector
A list of numbers. An arrow in space.
A vector is an ordered list: [0.3, −1.2]. Geometrically it is an arrow from the origin. A d-dimensional vector is an arrow in $d$-space, hard to picture past 3-D, but the rules are the same.
↕ drag tip
coords**[0.70, 0.50]** length0.86
§0.2 · Length ‖x‖ & Inner Product ⟨x,y⟩
How much one vector points along another.
Length = $\sqrt{x_1^2+x_2^2+\dots}$. Inner product $\langle x,y\rangle = x_1 y_1 + x_2 y_2 + \dots = \|x\|\|y\|\cos\theta$. The inner product reaches its largest positive value when the two arrows point in the same direction. It drops to zero when the two arrows are perpendicular. It becomes negative when the arrows point in opposite directions, with its most negative value when they point exactly opposite.
↕ drag either tip
‖x‖1.00 ‖y‖1.00 ⟨x,y⟩0.00 angle90°
§0.3 · Mean Squared Error
Why we square the mistake.
Error is the distance between a guess and the truth. Scoring a guess by the signed error lets positive and negative errors cancel, which means the score does not penalise being off. Squaring forces every error to count as a positive number and gives big errors a larger penalty than small ones. The guess that minimises the mean of squared errors is the data’s average: it is the unique number that minimises the sum of squared distances to the points.
The first moment of a quantity $X$ is its mean $\mathbb{E}[X]$; the second moment is the mean of its square $\mathbb{E}[X^2]$. A zero-mean variable has a vanishing first moment because positive and negative deviations cancel. Its second moment is strictly positive whenever any deviation is nonzero, because squared values are nonnegative and cannot cancel. The MSE above is itself a second moment of the residual error. This distinction returns in §7, where the per-input gap $\tilde y - y$ averages to zero in the first moment, while its square averages to a strictly positive quantity in the second.
The average has a property we will use in §7. It lies between the data’s most extreme points, so its magnitude is smaller than at least one of them. When a quantizer compresses a whole bin of values down to the bin’s average, the stored value is smaller in magnitude than the bin’s largest values. The reconstruction is a shrunken version of the input. An inner product against a shrunken reconstruction comes out smaller than the same inner product against the input.
Guess0.00
mean of data0.00 MSE at guess1.00 MSE at mean1.00
§0.4 · Unbiased vs Biased Estimator
Noisy is fine. Systematically off is not.
An estimator is a procedure that takes data and returns a guess $\hat\theta$ for an unknown truth $\theta$. Repeat it on fresh data and the guesses form a cloud. The cloud can fail in two independent ways. Variance is one: individual guesses are noisy. Bias is the other: the procedure is wrong even after averaging many guesses. An estimator with $\mathbb{E}[\hat\theta]=\theta$ is unbiased; the cloud’s centre sits at $\theta$ regardless of the cloud’s width.
The bullseye below shows both failure modes. Bias is the distance from the cloud’s centre to the crosshair. Variance is the width of the cloud. The two quantities are independent of each other. §7 runs the same bullseye against the MSE quantizer of §6, and the cloud’s centre lands away from the crosshair. §8 runs it against a different estimator whose cloud centres on the crosshair.
Mode
shots0 mean of shots**–** bias**–**
§0.5 · Rotation
A rigid spin. Preserves lengths and angles.
A rotation matrix $R$ spins space. The key property: $\|Rx\|=\|x\|$ and $\langle Rx,Ry\rangle=\langle x,y\rangle$. Rotation only changes the basis the coordinates are written in, not the geometry.
↕ drag tip
Angle θ0°
‖x‖ before → after1.41 → 1.41 preserved?yes
§0.6 · Where bell-curves come from (CLT)
Add up many small randoms → Gaussian.
The Central Limit Theorem says that summing enough independent random numbers produces a distribution close to a bell curve. The shape of each individual term in the sum does not affect the limit. A sum of coin flips converges to the same Gaussian shape as a sum of uniform draws or a sum of skewed draws. A rotated coordinate is one of these sums: it is a weighted combination of every coordinate of the original vector, with random weights. After a random rotation, each new coordinate is therefore approximately Gaussian, which is the property TurboQuant relies on for every input.
Terms in sum n1
Source
source shape**±1 coin** converged?no, n too small
§0.7 · Life in many dimensions
Coordinates of a random unit vector are all small.
Pick a random point on a unit sphere in $d$ dimensions. In 2-D any coordinate is possible. In 100-D, almost every coordinate is close to $\pm 1/\sqrt{d}$. This is measure concentration, and it is the core fact TurboQuant exploits.
Dim d2
std of x₁**≈ 0.71** 1/√d0.71
§0.8 · Quantization, in one dimension
Snap every number to the nearest of $2^b$ levels.
This is what $b$ bits per number means. With $b=2$ you get 4 levels, $b=3$ gives 8. The gap between levels is your worst-case error. Adding one bit halves the gap, so the squared error drops by 4× per bit, the $4^{-b}$ factor that shows up later.
Bits b2
levels4 gap Δ0.667 max error0.333
■ CHEAT SHEETEight ideas, one sentence each
Vector: ordered list of numbers / arrow from the origin. Length & inner product: the norm $\sqrt{\sum x_i^2}$ and how much two vectors point the same way. MSE: average squared error. Unbiased: the average of many estimates equals the truth. Rotation: change of basis that preserves lengths and angles. CLT: sum of many independent randoms converges to a Gaussian. High-D concentration: coordinates of a random unit vector in $d$-space cluster near $\pm 1/\sqrt d$. Quantization: snap each number to one of $2^b$ levels; one extra bit quarters the squared error.
§0.9 · Lineage and prior work
DRIVE (Vargaftik et al., NeurIPS 2021) introduced the construction for one bit per coordinate. A sender rotates the input vector by a random orthogonal matrix, sends the sign of every rotated coordinate together with a single scalar scale $S$, and the receiver inverts the rotation after multiplying the sign vector by $S$. DRIVE derives two scale formulas. The MSE-optimal biased scale is $S = \|R(x)\|_1 / d$. The unbiased scale is $S = \|x\|_2^2 / \|R(x)\|_1$, which gives $\mathbb{E}[\hat x] = x$. DRIVE also shows that a Randomized Hadamard Transform can replace the uniform random rotation at $O(d \log d)$ cost (DRIVE, §6).
EDEN (Vargaftik et al., ICML 2022) generalizes DRIVE to any $b$ bits per coordinate. After the rotation, EDEN normalizes the rotated vector by $\eta_x = \sqrt{d}/\|x\|_2$ so each coordinate is approximately $\mathcal{N}(0,1)$, then quantizes against a Lloyd-Max codebook designed once for the standard normal. The 1-bit codebook is $\{\pm\sqrt{2/\pi}\}\approx\{\pm 0.798\}$ and the 2-bit codebook is $\{\pm 0.453, \pm 1.510\}$. These are the exact codebooks the page derives in §5. EDEN keeps a per-vector scale $S = \|x\|_2^2 / \langle R(x),\, Q(\eta_x R(x))\rangle$ that yields an unbiased estimate (EDEN, Theorem 2.1).
■ IDEA-TO-SOURCE MAP
| Idea on this page | First introduced in |
|---|---|
| Random rotation, post-rotation distribution analysis (§3, §4) | DRIVE (2021), §3 |
| Randomized Hadamard transform as the practical rotation (§3) | DRIVE (2021), §6; EDEN (2022), §5 |
| Lloyd-Max codebook on $\mathcal{N}(0,1)$, including the $\{\pm\sqrt{2/\pi}\}$ 1-bit and $\{\pm 0.453,\,\pm 1.510\}$ 2-bit codebooks (§5) | EDEN (2022), §3 |
| Unbiased rotation-then-quantize via a per-vector scale (§7, §8 backbone) | DRIVE (2021), §4.2, Theorem 3 |
| The $b$-bit pipeline (§6) | EDEN (2022) with the per-vector scale fixed to a constant |
| QJL residual unbiasing (§8) | One-bit unbiased DRIVE applied to the residual instead of the input |
| Residual chain: biased $(b{-}1)$-bit + unbiased 1-bit (§7 then §8) | TurboQuant (2025) |
| KV-cache and inner-product application framing | TurboQuant (2025); QJL (2024) for the 1-bit case |
§1 · Vector quantization
You have a vector $\mathbf{x}\in\mathbb{R}^d$, say $d{=}1536$ floats from an OpenAI embedding. You want to store it using $b$ bits per coordinate (total $b\cdot d$ bits), then later recover an approximation $\tilde{\mathbf{x}}$ close to $\mathbf{x}$. Closeness is measured by
MSE distortion $D_{\text{mse}} = \mathbb{E}\big[\,\|\mathbf{x} - \tilde{\mathbf{x}}\|_2^2\,\big]$ or inner-product error $D_{\text{prod}} = \mathbb{E}\big[\,|\langle\mathbf{y},\mathbf{x}\rangle - \langle\mathbf{y},\tilde{\mathbf{x}}\rangle|^2\,\big]$
The second one matters because attention scores and nearest-neighbor queries are all inner products. We would like the estimator to be unbiased: $\mathbb{E}[\langle\mathbf{y},\tilde{\mathbf{x}}\rangle] = \langle\mathbf{y},\mathbf{x}\rangle$.
■ KEY WORDS
MSE distortion: average squared error between the true vector and its reconstruction, primer §0.3.
Inner product $\langle y, x\rangle$: how much two vectors point the same way, primer §0.2. This is what attention computes.
Estimator: a rule (here: quantize, then decode) that returns an approximation $\hat s$ of a true number $s$.
Unbiased estimator: across many queries, the average of $\hat s$ equals $s$. Individual estimates can be noisy; the mean is on target. Primer §0.4.
For each coordinate, pick the closest of $2^b$ evenly-spaced levels in $[-1, 1]$. That is $b$ bits per number. The same rule runs in 2D and 3D first, where the geometry is visible, before the high-dimensional version below.
Drag the tip of the vector. The vector snaps to the nearest point of a $2^b \times 2^b$ grid. The green arrow shows the original input. The blue arrow shows where the input is quantized to. The red segment between them is the reconstruction error $\mathbf{x} - \tilde{\mathbf{x}}$.
Bits b2
Preset:
↕ drag tip
‖error‖ / ‖x‖
–
levels per axis
4
grid points
16
A $2^b$-level grid on three axes gives $2^{3b}$ snap points. Drag the canvas to orbit the view. The spike preset shows where the construction breaks: the input lies near one axis and falls between two grid levels, which is where the reconstruction error is largest.
Bits b2
Preset:
↕ drag tip · ↻ orbit
‖error‖ / ‖x‖
–
levels per axis
4
grid points
64
The same rule applied to every coordinate of a high-dimensional vector. You cannot see the grid anymore, but the per-coordinate errors are still there.
Bits b3
Dimension d64
Input:
original $x_i$ quantized $\tilde{x}_i$
‖x − x̃‖² / ‖x‖²
–
levels per coord
–
bits used
–
Select the spike input. The naive quantizer's grid is spaced evenly over $[-1, 1]$. The input has almost all of its magnitude in a single coordinate, whose value falls between the two grid levels nearest to it and so reconstructs poorly. The remaining coordinates are near zero and consume most of the levels despite carrying little of the input's information.
■ TAKEAWAY · NEXT §2where the gap shows up
A fixed grid produces small reconstruction errors on inputs whose coordinates are roughly uniform in magnitude, and large reconstruction errors on inputs whose magnitude is concentrated in one or a few coordinates. Next: §2 shows how production systems handle the second case and what they pay for the fix.
§2 · Why naive fails
Real embeddings are rarely flat. Trained models produce outlier channels, a few coordinates much larger than the rest. A fixed $[-L, L]$ grid either clips the outliers or wastes resolution on the bulk. Production quantizers (GPTQ, AWQ, KIVI, KVQuant) work around this by computing $(\min, \max)$ (or zero-point and scale) for every small block and storing those in full precision as side information.
The catch. To decode any block you also need its scale and zero-point, two float16 numbers (32 extra bits) stored next to every 16–64 quantized values. Walk through one case: a block of 32 numbers at 3 bits each is 96 payload bits, plus 32 metadata bits, which works out to 4 bits per number, not 3. Smaller blocks of 16 numbers push it to 5 bits per number. The advertised 3-bit scheme is really a 4–5-bit scheme once you count everything. TurboQuant matches this worst-case quality while storing zero per-block metadata.
■ DEMO · feel the catch same b bits/value, three strategies
A 64-dimensional vector whose coordinates are mostly small, with one large outlier shown in red. Three quantizers reconstruct the same vector at the same b-bit budget. Strategy A uses a single fixed grid for the whole vector. Strategy B adapts the grid per block, at the cost of a float16 header per block. Strategy C rotates the vector first and then applies a single fixed grid. The metrics report the RMSE of each reconstruction and the effective bits-per-value once the metadata cost is included.
Outlier magnitude4.0
Bit budget b3
Block size s16
Source vector (outlier in red, dashed lines = fixed grid range)
A. Fixed grid [−L, L]
one global range, b bits/value, no header. Outlier clips.
RMSE
–
bits/value
–
overhead
0%
B. Per-block scale + zero
float16 scale+zero per block (dashed dividers). Outlier fits, header taxes you.
RMSE
–
bits/value
–
overhead
–
C. Rotate → fixed grid
rotation smears the spike across all 64 coords. One global grid works, no header.
RMSE
–
bits/value
–
overhead
0%
Read the storage line. The effective bits-per-value works out to b + 32/s for the per-block scheme and to b for the other two, because only the per-block scheme stores a float16 scale and zero-point (32 bits together) for every block of s elements. At b=3, s=16 the per-block cost works out to 3 + 2 = 5 bits/value, a 66% surcharge over the nominal b. Strategy C achieves the same storage cost as strategy A while producing the reconstruction quality of strategy B. The rest of this page explains the construction that makes that possible.
■ TAKEAWAY · NEXT §3one fixed recipe, any input
Production quantizers handle outliers by paying a per-block metadata tax. TurboQuant must instead be data-oblivious: a single procedure that runs on every vector with no calibration set and no per-block headers. Next: §3 introduces the move that makes a fixed grid work for every input.
§3 · The rotation trick
The rotation trick: apply a random orthogonal transform $\boldsymbol{\Pi}$, then quantize coordinate-wise. Rotation is lossless, it preserves length and inner products exactly:
$\|\boldsymbol{\Pi}\mathbf{x}\|_2 = \|\mathbf{x}\|_2$ · $\langle \boldsymbol{\Pi}\mathbf{x},\,\boldsymbol{\Pi}\mathbf{y}\rangle = \langle\mathbf{x},\mathbf{y}\rangle$ · $\boldsymbol{\Pi}^{\!\top}\boldsymbol{\Pi} = \mathbf{I}$
Because rotation is exact, all reconstruction error comes from the quantization step alone. After a uniformly random rotation, every coordinate of $\boldsymbol{\Pi}\mathbf{x}$ follows the same fixed Beta density (Lemma 1 of the paper), regardless of what $\mathbf{x}$ looked like. A single codebook designed once for that density is then optimal for every input. We build the codebook in §5.
Lineage The random-rotation step and the analysis of the post-rotation Beta density were introduced by DRIVE (Vargaftik et al., NeurIPS 2021, §3). DRIVE also shows the density approaches $\mathcal{N}(0,1)$ as $d$ grows, which is what makes a single fixed codebook work. See §0.9 for the full mapping.
How to construct $\boldsymbol{\Pi}$
Generate a $d\times d$ matrix of i.i.d. $\mathcal{N}(0,1)$ entries and run QR decomposition; keep the orthogonal factor $Q$. The result is uniform on the orthogonal group $O(d)$, which is what Lemma 1 needs.
Start with the extreme case: a vector with all of its magnitude in one coordinate, $(1, 0)$. Rotate by angle $\theta$ and observe how the magnitude is redistributed across the two coordinates. At $\theta{=}45°$ the magnitude is split evenly between the two coordinates, giving $(\tfrac{1}{\sqrt 2}, \tfrac{1}{\sqrt 2})$. The total length of the vector stays the same throughout.
Angle θ30°
geometry
↕ drag tip
coordinate magnitudes
max |coord|
–
length ‖x‖
1.000
θ at even split
45°
The same construction in three dimensions. The spike $(1, 0, 0)$ is rotated by a random orthogonal matrix, which spreads the input's magnitude across all three coordinates of the output. The total length of the vector is preserved. Each fresh draw of the random rotation produces a different spread.
geometry (drag to orbit)
↕ drag tip · ↻ orbit
coordinate magnitudes
max |coord|
–
length ‖x‖
1.000
typical max at random rot.
≈ 0.80
A single rotation in 2-D reduces the largest coordinate to at most half the input's magnitude. A random rotation in 3-D typically leaves one coordinate around $0.7$. At $d{=}64$ the largest coordinate after rotation is around $1/\sqrt d \approx 0.125$, regardless of how concentrated the input was.
Dimension d64
Input:
before, $|x_i|$ (Cartesian)
after, $|(\boldsymbol{\Pi}\mathbf{x})_i|$
max |xᵢ| / ‖x‖
–
max |(Πx)ᵢ| / ‖x‖
–
‖x‖ (preserved)
–
■ TAKEAWAY · NEXT §4no spike survives a random rotation
Rotation preserves length and inner products. The only thing it changes is which coordinates contain the magnitude of the vector. A vector with all of its mass concentrated in one coordinate becomes, after rotation, a vector whose mass is spread across all $d$ coordinates. Because every input is rotated before quantization, every input that gets quantized is of this spread-out kind. Next: §4 explains why rotation flattens spikes using the geometry of high-dimensional spheres.
§4 · Why rotation works
Rotating $\mathbf{x}$ by a uniformly random $\boldsymbol{\Pi}$ is the same as picking a random point on the sphere of radius $\|\mathbf{x}\|$. So the question “what does a coordinate of $\boldsymbol{\Pi}\mathbf{x}$ look like?” is the same question as “what does a coordinate of a uniform point on the sphere look like?”
In low dimensions the answer is far from a bell curve. In 2-D the marginal is the arcsine density, which is U-shaped with peaks at $\pm 1$. In 3-D it is uniform on $[-1, 1]$. As $d$ grows the marginal narrows and converges to a Gaussian with variance $1/d$. The convergence is visible in the demos that follow.
The exact density (Lemma 1)
For a uniform point on $\mathbb{S}^{d-1}$, the marginal density of any single coordinate is
$f_X(x) \;=\; \dfrac{\Gamma(d/2)}{\sqrt{\pi}\,\Gamma((d-1)/2)}\,(1-x^2)^{(d-3)/2},\quad x\in[-1,1]$
a scaled/shifted Beta distribution. It converges pointwise to $\mathcal{N}(0,\,1/d)$ as $d\to\infty$.
Sample 2000 points uniformly from the unit circle and look at a single coordinate, say $x_1$. The marginal is the arcsine density $\tfrac{1}{\pi\sqrt{1-x^2}}$, which is U-shaped with peaks at $\pm 1$. The shape is far from Gaussian: any value of $x_1$ between $-1$ and $+1$ is possible, and the endpoints are more likely than the middle.
points on the unit circle
marginal of $x_1$
shape of $x_1$
arcsine
std of $x_1$
–
$1/\sqrt{d}$
0.707
Now sample uniformly from the unit sphere in 3-D. The marginal of one coordinate is uniform on $[-1, 1]$ (Archimedes' hat-box theorem). The marginal is still not a bell curve. Drag to orbit the view.
points on the unit sphere
↻ drag to orbit
marginal of $x_1$
shape of $x_1$
uniform on $[-1,1]$
std of $x_1$
–
$1/\sqrt{d}$
0.577
Drag $d$ upward. The marginal narrows and converges to a Gaussian with standard deviation $1/\sqrt d$. By $d{=}30$ the marginal is visually Gaussian. By $d{=}256$ almost all of the mass concentrates within a thin shell of width $\sim 1/\sqrt d$ around zero.
Dimension d32
Samples10000
empirical histogram Beta PDF (exact) $\mathcal{N}(0, 1/d)$ approximation
Distinct coordinates are also approximately independent, a stronger condition than uncorrelated, and what is actually needed for the per-coordinate quantization argument below.
■ TAKEAWAY · NEXT §5one distribution, one codebook
Every coordinate of a rotated vector follows the same known density. The scalar quantization problem for that density can be solved once, and the solution can be reused for every coordinate of every vector. There are no per-block scale factors and no side information to store. Next: §5 builds the codebook with Lloyd–Max.
§5 · The universal codebook
Every rotated coordinate looks like a draw from the same density (§4). So there is one scalar problem to solve, once: pick $2^b$ landing values on the number line such that snapping any sample to its nearest landing value introduces as little error as possible. Those landing values are the codebook.
A classical algorithm finds them: Lloyd–Max (Lloyd 1957/82, Max 1960). Because the density is fixed and known in advance, Lloyd–Max runs once at table-build time. The resulting landing values are saved into a tiny per-$b$ table. Encoding a coordinate after that is a single nearest-neighbour lookup against the table. The same table is used for every input, with no calibration step and no per-vector tuning.
Drag $b$ below to watch Lloyd–Max settle on the landing values for the Beta density.
The Lloyd–Max iteration
Given a PDF $f_X$, choose centroids $c_1 \le \dots \le c_{2^b}$ minimising $\int (x - c_{i(x)})^2 f_X(x)\,dx$ by alternating:
Repeat until stable. The demo runs this on the Beta density of §4.
Bits b2
Dimension d64
Gaussian $\mathcal{N}(0,1/d)$ centroids $c_k$ bin boundaries
iteration
0
MSE per coord
–
Shannon bound 1/4^b / d
–
For moderate $d$, the paper's explicit centroids (after normalising by $\sqrt{d}$) are: $b{=}1\!:\pm\sqrt{2/\pi}$, $b{=}2\!:\{\pm 0.453,\pm 1.510\}$, and so on. Theorem 1 proves the per-coordinate MSE is $\lesssim \tfrac{\sqrt{3}\pi}{2d}\cdot 4^{-b}$. The constant $\tfrac{\sqrt{3}\pi}{2}\approx 2.72$ is the asymptotic ratio to Shannon's minimum $\tfrac{1}{d}\cdot 4^{-b}$; at $b{=}1$ the paper reports a tighter ratio of $\approx 1.45$.
Lineage The Lloyd–Max codebook for the post-rotation marginal is the codebook EDEN derives (Vargaftik et al., ICML 2022, §3). The 1-bit and 2-bit landing values shown above ($\pm\sqrt{2/\pi}$ and $\{\pm 0.453,\,\pm 1.510\}$) match the EDEN tables. See §0.9 for the full mapping.
■ TAKEAWAY · NEXT §6a tiny lookup, baked once
Lloyd–Max gives the optimal partition for a known density, so the centroids for the Beta marginal can be precomputed and stored as a tiny per-$b$ table. The per-coordinate MSE that the resulting codebook achieves is within a factor of $\approx 2.72$ of Shannon's lower bound asymptotically and within $\approx 1.45$ at $b{=}1$. Next: §6 assembles rotation and codebook into TurboQuant-MSE.
§6 · TurboQuant-MSE
STEP 1
Rotate
$\mathbf{y} = \boldsymbol{\Pi}\mathbf{x}$. Same $\boldsymbol{\Pi}$ reused for every vector.
STEP 2
Round each coord
For each $j$, $\texttt{idx}_j = \arg\min_k |y_j - c_k|$. Stores $b$ bits.
STEP 3
Store
Total: $b\!\cdot\!d$ bits. No scales, no zero-points.
STEP 4
Look up
$\tilde{y}_j = c_{\texttt{idx}_j}$ from the universal codebook.
STEP 5
Rotate back
$\tilde{\mathbf{x}} = \boldsymbol{\Pi}^{\!\top}\tilde{\mathbf{y}}$. Done.
Bits b3
Dimension d64
Input:
x (original)
Πx (rotated, nearly Gaussian)
quantized Πx (snap to codebook)
x̃ = Πᵀ·quant(Πx) (recovered)
error x − x̃
‖x − x̃‖² / ‖x‖²
–
naïve (no rotation)
–
Shannon floor 1/4^b
–
compression factor
–
Toggle between input types. Naive quantization without rotation fails on the spike input and on the outlier-channel input. With the rotation step in front, the reconstruction error is roughly the same regardless of which input is selected. Every rotated coordinate follows the same $\mathcal{N}(0,\,1/d)$ distribution, which is the distribution the codebook was designed for.
■ TAKEAWAY · NEXT §7MSE is solved, but…
TurboQuant-MSE stores $b\cdot d$ bits per vector and zero metadata. The reconstructed $\tilde{\mathbf{x}}$ is nearly as close to the original $\mathbf{x}$ as any quantizer can achieve, within a factor of $\approx 2.72$ of Shannon's information-theoretic lower bound. Next: §7 shows that the same codebook produces a systematically biased estimate of inner products. This is an error that minimising reconstruction MSE does not address.
§7 · The inner-product bias
§6’s TurboQuant-MSE keeps $\tilde{\mathbf{x}}$ close to $\mathbf{x}$ in squared distance. Attention does not measure $\|\mathbf{x}-\tilde{\mathbf{x}}\|^2$. It computes $\langle \mathbf{q}, \tilde{\mathbf{k}}\rangle$ and uses that number as a stand-in for $\langle \mathbf{q}, \mathbf{k}\rangle$. The MSE codebook gives a systematically wrong answer to the inner-product question. Each trial returns the same error, so averaging many trials does not remove it.
Two earlier facts produce the shrinkage. In §0.3 the MSE-optimal reconstruction for a set of values was the set’s average, and that average had smaller magnitude than the set’s extreme values. In §4 a random rotation made every coordinate of $\boldsymbol{\Pi}\mathbf{x}$ behave like a zero-mean draw with most of its mass close to 0. Combine the two and the shrinkage is forced: the encoder partitions each axis into $2^b$ bins and stores only which bin $\boldsymbol{\Pi}\mathbf{x}$ fell into, the decoder reconstructs with the bin’s average, and the bin’s average sits closer to 0 than the tail inputs that fall into the same bin. The reconstruction $\tilde{\mathbf{x}}$ is therefore a shrunken copy of $\mathbf{x}$, and an inner product $\langle \mathbf{q}, \tilde{\mathbf{k}}\rangle$ comes out smaller than $\langle \mathbf{q}, \mathbf{k}\rangle$. Because the codebook is fixed, the shrinkage factor is identical on every trial.
■ SEE THE SHRINKAGEdrag y, watch ỹ snap
One rotated coordinate $y$ has the near-Gaussian density drawn on top. Lloyd–Max partitions the axis into $2^b$ bins (interior verticals); each bin’s centroid is the MSE-optimal reconstruction (red dots). Drag the mint handle to set $y$. The encoder snaps it to the centroid of the bin it fell into, giving $\tilde y$ (red). The staircase underneath plots that map $\tilde y(y)$ across the whole axis at once: every horizontal step sits inside the dashed identity line, and the gap between step and identity is the shrinkage at that input.
Bits b1
↔ drag y
Variance budget. σ² = 1 splits into the part ỹ keeps and the part erased inside each bin.
■ kept by ỹ: λ_b_ = 0.637 ■ erased in-bin variance: D__b = 0.363
y
1.500
ỹ
–
(ỹ − y)²
instantaneous, averages to Db
–
λ_b_ = E[ỹ²]/σ²
population, what counts
–
What to notice. Shrinkage is a second-moment statement. The signed gap $\tilde y - y$ is positive for some inputs and negative for others; averaging it gives zero, so any first-moment argument cancels out. The squared gap $(\tilde y - y)^2$ in the third metric is always nonnegative, so summing it cannot cancel. Weighted by the Gaussian density and integrated, it equals $D_b = \sigma^2 - \mathbb{E}[\tilde y^{\,2}]$, the red segment of the budget bar; the staircase shading visualizes that gap pointwise, with opacity tracking the Gaussian density so the visual area concentrates where the distortion actually accumulates. As $b$ grows the shading thins and the red segment shrinks with it. The fourth metric, $\lambda_b = \mathbb{E}[\tilde y^{\,2}]/\sigma^2$, is the factor that multiplies every inner product $\langle\mathbf q,\tilde{\mathbf k}\rangle$ in expectation; that is the shrinkage the next paragraph quotes as $0.64 / 0.88 / 0.97 / 0.99$ for $b=1,2,3,4$.
The bullseye below measures the shrinkage. At $b{=}1$ the offset is $1 - 2/\pi \approx 0.36$ on every axis. The shrinkage factor approaches 1 quickly with more bits (about 0.88 at $b{=}2$, 0.97 at $b{=}3$, 0.998 at $b{=}5$), so by $b{=}3$ the residual bias is smaller than the trial-to-trial noise of a few thousand shots and the red dot visually overlaps the crosshair. The bias is theoretically strictly nonzero at every finite $b$, but the regime where it matters in practice is the low-bit one (1–2 bits per coordinate), where it dominates the per-trial variance.
■ HOW TO READdrag b, watch the red dot
Same bullseye as the primer. Each trial fires two shots at the target, one inner-product estimate against $\mathbf{y}_1$ and one against an independent $\mathbf{y}_2$, both divided by their truth and re-centred so a perfect estimate lands on the centre. The yellow crosshair marks truth, the red dot is the average of every shot fired so far. Unbiased means the red dot sits on the crosshair, no matter how wide the cloud of shots around it.
Bits b1
Dimension d128
Trials2000
MSE-optimal codebook · biased (cloud’s centre is off the crosshair)
MSE mean ratio
–
MSE theory
0.6366
truth
1.0000
What to notice. At $b{=}1$ the red dot is southwest of the crosshair, on the diagonal. The offset on $\mathbf{y}_1$ and the offset on $\mathbf{y}_2$ are equal, which is what one scalar shrinkage applied to the whole reconstruction would produce. Increase $b$: the offset shrinks fast and is below the trial-to-trial noise by $b{=}3$, even though the underlying shrinkage factor is still strictly less than 1.
Derivation: where the $2/\pi$ factor comes from
For a standard Gaussian $g$, $\mathbb{E}[|g|]=\sqrt{2/\pi}$, the “half-normal” mean. The 1-bit MSE codebook rounds each rotated coordinate to $\pm\sqrt{2/\pi}/\sqrt d$; when you dot-product that reconstruction back against $\mathbf{y}$, you pick up another $\sqrt{2/\pi}$ factor in expectation. Multiply: $2/\pi \approx 0.637$.
Concretely at $b{=}1$, the optimal MSE codebook is $\{-\sqrt{2/\pi}/\sqrt{d},\,+\sqrt{2/\pi}/\sqrt{d}\}$, so $Q(\mathbf{x}) = \sqrt{2/(\pi d)}\cdot \operatorname{sign}(\boldsymbol{\Pi}\mathbf{x})$ and
$\mathbb{E}\big[\langle\mathbf{y},\tilde{\mathbf{x}}\rangle\big] \;=\; \dfrac{2}{\pi}\cdot\langle\mathbf{y},\mathbf{x}\rangle.$
The factor shrinks as $b$ grows but never vanishes, which is what the demo above shows.
■ TAKEAWAY · NEXT §8we need a different kind of estimator
An MSE-optimal codebook minimises squared reconstruction error. The cost is a fixed scalar shrinkage on every inner product, and this shrinkage stays nonzero at any finite bit budget. Attention and nearest-neighbour search need an inner-product estimator whose mean is correct. Next: §8 keeps the same encoder and adds a fixed prefactor on the decoder side equal to the reciprocal of the shrinkage. The mean of many trials then equals $\langle \mathbf{q}, \mathbf{k}\rangle$.
§8 · QJL: the un-biaser
§7 ended with a shrunken reconstruction. The MSE codebook produces $\tilde{\mathbf{x}}$ values whose magnitudes are smaller than the inputs they encode, so every inner product $\langle \mathbf{y}, \tilde{\mathbf{x}}\rangle$ comes out smaller than $\langle \mathbf{y}, \mathbf{x}\rangle$ by the same scalar factor. At one bit per coordinate that factor is exactly $2/\pi$. Averaging over trials does not move the estimate toward $\langle \mathbf{y}, \mathbf{x}\rangle$, because the same scalar multiplies the result on every trial.
A deterministic scalar bias is removable without changing the encoder. Multiply the decoder's output by the reciprocal of the bias and the expectation of the product equals the unbiased target. QJL applies this idea at one bit per coordinate. The encoder discards magnitude information, which is the same step that shrank §7's reconstruction. The decoder applies a fixed prefactor whose value is the reciprocal of the half-normal shrinkage that sign quantization introduces.
Encoder. Sample one random Gaussian matrix $\mathbf{S}$ once and share it between every encoder and decoder. To store $\mathbf{x}$, write down the signs of $\mathbf{S}\mathbf{x}$. The stored object is one bit per coordinate; the magnitudes of the entries of $\mathbf{S}\mathbf{x}$ are discarded. Discarding the magnitudes produces the bit savings and also produces a $\sqrt{2/\pi}$ shrinkage on any reconstruction built from the signs alone, by the same half-normal identity that produced §7's $2/\pi$.
Decoder. A full-precision query $\mathbf{y}$ arrives. Compute $\langle \mathbf{S}\mathbf{y},\,\text{stored signs}\rangle$. This quantity is a noisy estimate of $\langle \mathbf{x},\mathbf{y}\rangle$ scaled down by $\sqrt{2/\pi}$. Multiply by $\sqrt{\pi/2}/d$. The factor $\sqrt{\pi/2}$ is the reciprocal of the half-normal shrinkage and cancels it in expectation; the factor $1/d$ averages the estimate over the $d$ rows of $\mathbf{S}$. The expected value of the result is $\langle \mathbf{x}, \mathbf{y}\rangle$. The per-trial variance is larger than the MSE estimator's variance, but the mean of many trials converges to $\langle \mathbf{x}, \mathbf{y}\rangle$.
■ HOW TO READsame target, two estimators
Both panels use exactly 1 bit per coordinate. Left: the MSE-optimal codebook from §7, biased. Right: QJL with its calibration constant baked in. Each trial fires two shots (against independent $\mathbf{y}_1$ and $\mathbf{y}_2$). Same number of trials, same target. Watch where the red dot lands.
Dimension d128
Trials1500
MSE-optimal · 1 bit (biased)
QJL · 1 bit (unbiased)
MSE mean ratio
–
QJL mean ratio
–
truth
1.0000
What to notice. The MSE panel's red dot is southwest of the centre at the same offset as §7's 1-bit measurement, and that offset stays the same regardless of how many trials run. The QJL panel's red dot lands close to the centre but with a residual offset from finite-sample noise. QJL's per-trial variance is larger than MSE's (Lemma 4: $\propto \pi/(2d)$), so at the default trial count the residual offset is small but visible. The key difference between the two estimators is the source of this offset: MSE's offset is a fixed scalar bias on the inner product and does not shrink with more trials; QJL's residual offset is sampling noise around a correct mean and shrinks at the standard-error rate $1/\sqrt{n}$ as the trial count grows.
The math: definition and where $\sqrt{\pi/2}/d$ comes from
With $\mathbf{S}\in\mathbb{R}^{d\times d}$ i.i.d. $\mathcal{N}(0,1)$:
$Q_{\text{jl}}(\mathbf{x}) = \operatorname{sign}(\mathbf{S}\mathbf{x}) \in \{-1,+1\}^d, \quad \widehat{\langle \mathbf{x},\mathbf{y}\rangle} = \frac{\sqrt{\pi/2}}{d}\, \langle \mathbf{S}\mathbf{y},\,Q_{\text{jl}}(\mathbf{x})\rangle.$
Each row $\mathbf{s}_i$ makes $\mathbf{s}_i\mathbf{x}$ and $\mathbf{s}_i\mathbf{y}$ jointly Gaussian with covariance $\langle\mathbf{x},\mathbf{y}\rangle$. The half-normal identity gives $\mathbb{E}[(\mathbf{s}_i\mathbf{y})\,\text{sign}(\mathbf{s}_i\mathbf{x})] = \sqrt{2/\pi}\cdot\langle\mathbf{x},\mathbf{y}\rangle/\|\mathbf{x}\|$. Sum over $d$ rows and multiply by $\sqrt{\pi/2}/d$: the $\sqrt{2/\pi}$ shrinkage cancels, and the result is $\langle\mathbf{x},\mathbf{y}\rangle$ in expectation. Variance is bounded by $\tfrac{\pi}{2d}\|\mathbf{x}\|^2\|\mathbf{y}\|^2$ (Lemma 4 of the paper).
QJL by itself uses one bit per coordinate. TurboQuant-prod extends the construction to a $b$-bit budget by allocating the bits between the two estimators from §6 and §8. The first $b{-}1$ bits encode $\boldsymbol{\Pi}\mathbf{x}$ with the MSE codebook of §6 to capture magnitude. The last bit encodes the residual $\mathbf{r} = \boldsymbol{\Pi}\mathbf{x} - \tilde{\mathbf{y}}_{\text{mse}}$ with QJL to make the inner-product estimate unbiased. The total cost is $b\cdot d$ bits plus one scalar per vector (the residual norm $\|\mathbf{r}\|$), the same as TurboQuant-MSE.
The full TurboQuant-prod recipe
The residual norm is the only piece of side info in the whole scheme, one scalar per vector, not one per small block the way GPTQ, AWQ, or KIVI need. Variance is bounded by Theorem 2.
■ TAKEAWAY · NEXT §9two recipes, one budget
TurboQuant-MSE minimises reconstruction error and produces a biased inner-product estimate with a known shrinkage factor. TurboQuant-prod allocates one of its $b$ bits to a QJL residual and produces an unbiased inner-product estimate at higher per-trial variance. Both schemes use $b\cdot d$ bits plus one scalar per vector. Next: §9 compares both upper bounds against the information-theoretic lower bound.
§9 · Shannon's floor
The paper uses Shannon's lossy source-coding theorem (via Yao's minimax principle) to prove that no quantizer can do better than $D_{\text{mse}} \ge 4^{-b}$ on worst-case inputs on the unit sphere. The bound covers every conceivable quantizer, including randomized and data-adaptive ones. TurboQuant's matching upper bound is $\tfrac{\sqrt{3}\pi}{2}\cdot 4^{-b}$, within a factor of $\approx 2.7$ of the lower bound asymptotically and within a factor of $\approx 1.45$ at $b{=}1$.
Dimension d128
Samples / b500
TurboQuant-MSE measured upper bound $\tfrac{\sqrt{3}\pi}{2}\cdot 4^{-b}$ Shannon lower bound $4^{-b}$
The plot uses a log scale on the vertical axis. All three curves have the same slope (the $4^{-b}$ exponential rate) and differ only by a small constant offset.
Earlier data-oblivious quantizers (uniform rounding, scalar sketches) achieve a reconstruction error that decays only polynomially in the bit budget, e.g. $\mathcal{O}(1/b)$. TurboQuant's $4^{-b}$ rate is exponential in $b$. That exponential rate is what enables the $4$–$6\times$ KV-cache compressions reported in §10 without measurable downstream quality loss.
■ TAKEAWAY · NEXT §10tight against the floor
The upper bound, the lower bound, and the measured error all decay at the same exponential rate $4^{-b}$ as $b$ grows, and they differ from each other only by a small constant. TurboQuant therefore matches Shannon's $4^{-b}$ rate to within a factor of $\approx 2.72$ asymptotically and $\approx 1.45$ at $b{=}1$. Next: §10 looks at the systems consequences of this rate.
§10 · Why it matters
Needle-in-a-Haystack recall on Llama-3.1-8B-Instruct, every compressed method evaluated at a $4\times$ memory-compression target (paper Fig. 4):
| Method | NiaH |
|---|---|
| Full cache (FP16) | 0.997 |
| SnapKV | 0.858 |
| PyramidKV | 0.895 |
| KIVI | 0.981 |
| PolarQuant | 0.995 |
| TurboQuant | 0.997 |
TurboQuant matches the full-precision NiaH score at $4\times$ compression. On LongBench-V1 (paper Table 1), TurboQuant at $3.5$ bits per channel matches the full-precision average ($50.06$); at $2.5$ bits per channel it stays within $\approx 1\%$ of full precision ($49.44$ vs $50.06$), a $6.4\times$ compression.
Quantization time on 100K vectors, 4-bit quantization (paper Table 2):
| Method | d=200 | d=1536 | d=3072 |
|---|---|---|---|
| Product Quantization | 37.04 s | 239.75 s | 494.42 s |
| RabitQ | 597.25 s | 2267.59 s | 3957.19 s |
| TurboQuant | 0.0007 s | 0.0013 s | 0.0021 s |
TurboQuant is between four and six orders of magnitude faster than the alternatives at 4-bit indexing, and the paper reports higher recall as well. The reason it is so fast is that the encoder is a fixed rotation followed by a lookup against a precomputed table. There is no codebook to learn from data and no per-block scales to fit at index time.
■ TAKEAWAYthe whole trick, in one line
After a random rotation, every coordinate of every input has the same fixed distribution: a low-variance Beta that converges to a Gaussian as $d$ grows. A single optimal codebook designed once for that distribution serves every input. The full vector-quantization problem reduces to the well-studied scalar quantization problem.