Anisotropy and the cone ideas may explain why PCA underperforms, but it does not uniquely justify this particular quadratic decoder. The geometric story is not doing explanatory work beyond “data is nonlinear,” and the real substance is simply that second-order reconstruction empirically helps.
By representing data as multivectors, translational and rotational symmetries are encoded natively which allows them to handle geometric hierarchies with massive efficiency gains (reports of up to 78x speedups and 200x parameter reductions) compared to standard Transformers.
> A novel sequence architecture is introduced, Versor, which uses Conformal Geometric Algebra (CGA) in place of traditional linear operations to achieve structural generalization and significant performance improvements on a variety of tasks, while offering improved interpretability and efficiency. By embedding states in the manifold and evolving them via geometric transformations (rotors), Versor natively represents -equivariant relationships without requiring explicit structural encoding. Versor is validated on chaotic N-body dynamics, topological reasoning, and standard multimodal benchmarks (CIFAR-10, WikiText-103), consistently outperforming Transformers, Graph Networks, and geometric baselines (GATr, EGNN).
But for RAG that might be too much work per vector?
That said I do think it's a good habit to either write out abbreviations in full or link to say Wikipedia, eg for PCA[1]. It's a well-known tool but still if you come from a slightly different field it might not ring a bell.
[1]: https://en.wikipedia.org/wiki/Principal_component_analysis
None of this stuff is as difficult to understand as people claim it is once you work with it.
The most direct way to compress an embedding (other than quantization) is to fit PCA on the corpus and keep the top-d eigenvectors. It works, but PCA is a linear projection, and neural-network embeddings on the sphere are structurally nonlinear — the well-known cone effect in transformers. Some of the variance lives in a nonlinear tail that a linear decoder can’t reach.
This post is about a closed-form way to add a quadratic decoder on top of PCA, to capture part of that nonlinear tail. The encoder stays as plain PCA. The decoder is a degree-2 polynomial lift plus Ridge OLS (ordinary linear regression with L2 regularization), also closed-form. No SGD, no epochs, no hyperparameter search. One np.linalg.solve over corpus statistics.
The construction itself isn’t mine. “PCA encoder + quadratic decoder + least-squares fit” appears in the dynamical-systems literature under the name quadratic manifold (see Jain 2017, Geelen-Willcox 2022/2023, Schwerdtner-Peherstorfer 2024+ — more in §9). I only stumbled onto these papers after running my experiments and believing the construction was new. The polynomial lift doesn’t come up much in modern ML conversations, and this post is a note about a useful trick from an adjacent discipline carrying over to retrieval.
The concrete result. BEIR/FiQA, mxbai-embed-large-v1 (1024d), per-vector budget 512 bytes. Metric is NDCG@10 (Normalized Discounted Cumulative Gain over the top-10, the standard retrieval ranking-quality measure; range [0, 1], higher is better):
| method | NDCG@10 | Δ vs raw 1024d |
|---|---|---|
| raw 1024d (4096 bytes) | 0.4525 | — |
| PCA top-256 | 0.4168 | -3.58 p.p. |
| poly-AE 256d | 0.4441 | -0.85 p.p. |
| matryoshka top-256 | 0.4039 | -4.86 p.p. |
PCA already gives 4× per-vector memory compression at -3.58 p.p. NDCG. A quadratic decoder on top of PCA pulls another +2.73 p.p. — closing almost the entire gap to raw, at the same byte budget. Matryoshka is in the table as another familiar baseline (here it drops more than PCA — a known side observation, not the central claim of this post; see §3 footnote and §4).
Measured on four models (nomic-v1.5, mxbai-large, bge-base, e5-base). Poly-AE over PCA: +1 to +4.4 p.p. at d=128, +0.03 to +2.7 p.p. at d=256. Full table in §3.
Full implementation — ~150 lines of numpy, MIT, repository github.com/IvanPleshkov/poly-autoencoder. The BEIR eval script is beir_eval.py in the same repo. Reproduces in 30–40 minutes on an M-series MacBook (10–15 min to encode the corpus, ~15 min for the Ridge solve at d=256).
Contents:
Four lines in every measurement:
embedding[:d] plus L2 normalization. On models trained with a matryoshka loss (nomic, mxbai in our sample), this is a valid matryoshka vector. On models without matryoshka training (bge-base, e5-base) it’s a test of what happens when you naively slice a non-MRL model — the scenario users of the bge family, e5, and custom-fine-tuned embeddings actually fall into.p ∈ ℝ^d, decode with a quadratic polynomial back to a full D-dimensional V̂, retrieve on V̂.At a fixed d, all four methods store 2d bytes per vector (fp16 coordinates).
BEIR is the standard set of retrieval datasets (SciFact, FiQA, NFCorpus, TREC-COVID, etc.). The metric is NDCG@10. Corpus and queries are encoded with the chosen model, top-10 are retrieved by cosine similarity, and NDCG is computed against the labeled qrels.
PCA and poly-AE are fit transductively: statistics are computed on the corpus we want to compress. Queries never participate in the fit — they hit a fixed encoder/decoder at inference time. This matches a production deployment: an index operator computes PCA + Ridge once on their data and then serves queries.
For the main runs we use FiQA — 57K documents, 648 queries, 1706 qrels.
NDCG@10 on FiQA at budgets of 256 fp16 (512 bytes/vector) and 128 fp16 (256 bytes):
| Model | D | d | raw | matryoshka† | PCA | poly-AE | poly over matryoshka | poly vs raw |
|---|---|---|---|---|---|---|---|---|
| nomic-embed-text-v1.5 | 768 | 128 | 0.3746 | 0.3190 | 0.3273 | 0.3380 | +1.90 p.p. | -3.65 p.p. |
| nomic-embed-text-v1.5 | 768 | 256 | 0.3746 | 0.3508 | 0.3670 | 0.3673 | +1.65 p.p. | -0.73 p.p. |
| mxbai-embed-large-v1 | 1024 | 128 | 0.4525 | 0.3503 | 0.3689 | 0.4129 | +6.26 p.p. | -3.97 p.p. |
| mxbai-embed-large-v1 | 1024 | 256 | 0.4525 | 0.4039 | 0.4168 | 0.4441 | +4.02 p.p. | -0.85 p.p. |
| bge-base-en-v1.5* | 768 | 128 | 0.4062 | 0.2914 | 0.3266 | 0.3654 | +7.40 p.p. | -4.09 p.p. |
| bge-base-en-v1.5* | 768 | 256 | 0.4062 | 0.3574 | 0.3688 | 0.3958 | +3.84 p.p. | -1.05 p.p. |
| e5-base-v2* | 768 | 128 | 0.3987 | 0.2498 | 0.3065 | 0.3317 | +8.18 p.p. | -6.70 p.p. |
| e5-base-v2* | 768 | 256 | 0.3987 | 0.3333 | 0.3618 | 0.3852 | +5.19 p.p. | -1.35 p.p. |
† The “matryoshka” column is embedding[:d] plus L2 normalization. On nomic and mxbai it’s a valid matryoshka vector. On models marked with * (bge-base, e5-base) the model wasn’t trained for matryoshka, and the slice here is a test of what happens when you naively slice a non-MRL model. This is the scenario that users of the bge family, e5, and custom-fine-tuned embeddings actually fall into — and we measure it here honestly.
What this shows:
d — discussed in §4.d=256, poly-AE loses 0.7–1.4 p.p. NDCG vs raw 768/1024 on all four models. 4× per-vector memory compression for less than 1.5 p.p. lost — the main number of the post.PCA is the linear baseline. The quadratic decoder adds the nonlinear piece that the linear one can’t reach (mechanics in §5). How much does that actually help on retrieval, and at what d?
Poly-AE lift over PCA, by model and d:
| Model | poly over PCA, d=128 | poly over PCA, d=256 |
|---|---|---|
| nomic-v1.5 | +1.07 p.p. | +0.03 p.p. |
| mxbai-large | +4.40 p.p. | +2.73 p.p. |
| bge-base | +3.88 p.p. | +2.70 p.p. |
| e5-base-v2 | +2.52 p.p. | +2.34 p.p. |
The picture:
At d=128 (8× compression) poly is consistently 1–4 p.p. ahead of PCA. This is the regime where the linear decoder starts dropping noticeable variance into the nonlinear tail, and the quadratic correction pulls it back. Sweet spot for the method.
At d=256 (4× compression) the gap is uneven. On mxbai/bge/e5 — a stable +2.3–2.7 p.p. On nomic — close to zero (+0.03). Likely reason: nomic was carefully trained with multi-slice contrastive loss, its latent is more isotropic, and at d=256 the linear projection already takes most of what’s there. On non-MRL models the nonlinear tail is bigger → poly helps more.
More anisotropy → bigger lift. The stronger the cone effect, the more variance lives in the nonlinear tail that PCA can’t reach but poly can. That’s the geometry §5 unpacks.
In §3 you can see that on non-MRL models (bge, e5) the matryoshka column drops more than PCA — i.e. on a random non-MRL-trained model, naive slicing works worse than a corpus-side linear projection. This is a known result; the “MRL vs PCA on retrieval” question has been discussed independently of this post — see Matryoshka-Adaptor 2024, SMEC “Rethinking MRL” 2025, CoRECT 2025, and the YouTube video literally titled «Is PCA enough?». This post compares PCA and poly-AE; matryoshka is in the §3 table as a third reference point.
A corpus-side PCA fit is a required part of the method. That means poly-AE doesn’t work when the corpus isn’t available:
In those settings you want an MRL-trained model and embedding[:d], and poly-AE isn’t an alternative — it also needs corpus statistics.
In the operator-fit setting (fixed corpus, the operator fits compression once), you have two working modes:
PCA is the best possible linear projection of data into a d-dimensional subspace. But “best linear” doesn’t mean “good enough”: if the data has nonlinear structure, a linear decoder can’t reach it, period.
Transformer embeddings have such structure, well-studied — the cone effect. The point cloud is concentrated inside a narrow cone on the unit sphere and is heavily nonlinearly structured inside that cone.
Left: isotropic data. Right: one example of what anisotropic data can look like.
drag to rotate
PCA only catches projections along orthogonal eigenvectors — i.e. the linear ellipsoid that the cloud doesn’t actually fit in. Whatever lives in the curvature of the manifold is structurally invisible to a linear decoder. The stronger the anisotropy (the narrower the cone), the more variance sits in this nonlinear tail that PCA structurally can’t reach.
So what we want is clear. A decoder that can deal in quadratic combinations of coordinates — i.e. a decoder that captures local curvature of the manifold. Then we’d recover some of the information that PCA loses.
§5 made it clear that we need a nonlinear decoder. The straightforward route is to train a neural network. But then we lose the closed-form pipeline: SGD, learning rate, batch size, convergence, early stopping. We’d like a nonlinear decoder solvable by a single formula.
Standard regression trick: the polynomial lift. Take vector p and lift it into all monomials up to degree 2 — bias, linear terms, squares, and pairwise products. On a 2D example:
lift([p₁, p₂]) = [1, p₁, p₂, p₁², p₁·p₂, p₂²]
↑ ↑ ↑ ↑ ↑ ↑
bias linear quadratic
Any linear combination of these six = a quadratic function of the original p₁, p₂. So a linear regression on the lift = a quadratic regression in the original space. No nonlinear optimization needed, plain Ridge OLS in closed form does it.
Apply this to our case. The encoder is top-d PCA (p = (V − V̄) @ Q, p ∈ ℝ^d). Lift p into all monomials up to degree 2:
def polynomial_lift(p, degree=2):
features = [1.0] # bias
features.extend(p) # degree 1
for i in range(len(p)):
for j in range(i, len(p)):
features.append(p[i] * p[j]) # degree 2
return np.array(features)
The lift dimension is M = (d+1)(d+2)/2. For d=128 that’s M = 8385, for d=256 it’s M = 33153. The decoder is a linear regression from the M-dimensional lift back to the D-dimensional original. Despite being “just a linear regression”, the resulting map p → V̂ is quadratic — exactly what we needed.
Training the decoder is one np.linalg.solve:
def fit_decoder(P, V, lam=1e-3):
L = polynomial_lift(P, degree=2) # (N, M)
G = L.T @ L + lam * np.trace(L.T @ L) / L.shape[1] * np.eye(L.shape[1])
W = np.linalg.solve(G, L.T @ V)
return W
The polynomial autoencoder is assembled:
np.linalg.solve,V̂ = polynomial_lift(p) @ W,V_resid = V − V̂ (useful in §8 for quantization).No backprop, learning rate, batch size, epochs. On 100K vectors at d=100, a couple of minutes on CPU. At d=256, around 15 minutes (Ridge solves an M³ system, cubic in M).
Raw p after PCA has huge anisotropy across its coordinates (the first axis ~50× the variance of the last) — bad for Ridge: large axes dominate the solution, small ones get under-weighted. Fix is per-axis std normalization:
p_normed = p / np.sqrt(eigvals)
p = p_normed * (0.9 / np.linalg.norm(p_normed, axis=1).max())
The global 0.9 / max-norm factor gives ‖p‖ ≤ 0.9 — numerical stability under squaring. Doesn’t affect NDCG, but cleaner numerically.
A methodological note. On the first sanity check on SciFact (5183 docs, 300 queries), poly-AE at d=256 showed NDCG@10 0.6980 vs raw 0.7032 — losing only 0.5 p.p. The reported gain over PCA there was +1.36 p.p.
On FiQA (57638 docs), the same experiment gave:
What happened. On SciFact at d=256 Ridge solves a regression with M=33153 features on N=5183 vectors — severely overdetermined: the in-sample V̂_corpus matches V_corpus almost exactly, so retrieval on V̂_corpus is equivalent to retrieval on the full 768-d V_corpus. The SciFact poly-AE measurement inflated its gap with PCA because it memorized the corpus in the Ridge weights, not because it generalized better.
What’s important: the poly-vs-raw drop is stable (-0.5 to -0.85 p.p.) across both corpora. But for poly-vs-PCA, on small corpora (N ≤ 10K at d=256) the gap is partly mythical. A safe rule of thumb: N should be at least 5–10× larger than M = (d+1)(d+2)/2. For d=256 that’s N ≳ 200K. For d=128 it’s N ≳ 50K.
In the headline table (§3), FiQA’s N = 57K is enough for d=128 and borderline for d=256. The bge numbers on FiQA at d=256 should be read with that caveat.
The autoencoder doesn’t compress anything by itself, strictly speaking. It repackages a vector into a pair (p, V_resid) — latent code plus residual. Same information, different shape. To get actual compression, the residual has to be quantized.
This is where poly-AE’s anisotropy-removing property matters. On DBpedia-OpenAI at d=100, cond(V_resid) drops from 72 to 3.4 — almost isotropic residual. Linear PCA at the same d gives cond=4.21 — noticeably less isotropic.
Google Research recently published TurboQuant — a quantizer with a fixed random rotation that on isotropic distributions gets close to the theoretical compression limit (Shannon rate-distortion for normals). The composition is natural: the autoencoder strips out the anisotropy, TurboQuant efficiently quantizes the now-isotropic residual.
V ──► encoder (PCA) ──► p
│
decoder (poly+Ridge) ──► V̂
│
V − V̂ ──► V_resid ──► TurboQuant ──► compact code
For Qdrant 1.18 I shipped a TurboQuant modification with anisotropy compensation. It achieves roughly the same recall as the “poly-AE + residual quantization” hybrid, but without needing to store the poly-AE latent coordinates — saving the p-side of the per-vector byte budget. The tricks I layered on top of standard TurboQuant will be in a separate post.
I’ve been familiar with the polynomial lift since my time at ACD/Labs — in cheminformatics it’s a standard tool for building features from molecular structure (as input for downstream regression). Currently I work at Qdrant. When I was implementing late-interaction retrieval support there (ColBERT-style multi-vectors with maxsim, v1.10), I remembered the trick — and that’s how the “PCA + quadratic decoder” idea came together. I thought no one had tried it for neural embeddings yet.
When the draft went to Sava Kalbachou for review, he pointed out specific papers where this exact construction has long been used — in numerical modeling of physical systems, under the name quadratic manifold:
In those papers the construction is applied to solution-trajectory snapshots of physical-system simulations. Embeddings are a different data geometry, and it’s not obvious in advance that quadratic manifold should also work there; the numbers in §3 show that it does.
These are just speculations. The communities probably don’t overlap much: quadratic-manifold work appears in engineering journals and rarely surfaces in ML research.
Also, in ML’s collective memory polynomial methods are mostly associated with the kernel trick (SVM, kernel PCA) — a different construction, and the explicit polynomial lift as a stand-alone technique rarely comes up after that.
And an “autoencoder” is usually reflexively understood as a neural network in modern ML; closed-form variants may simply fall outside the usual field of attention.
So this is an applied-ML post: a technique developed in an adjacent field turns out to be useful for embeddings as well. The contribution of the post is the port and the empirical check.
A few honest caveats:
d=256, M = 33K, and np.linalg.solve in closed form costs O(M³) ≈ 5–15 minutes on CPU. At d=384 that’s already M = 74K and M³ is unmanageable. The method is comfortable up to d ≈ 200–256. For larger d you’d need randomized methods (random feature approximation), or move to a neural decoder.N < 5M (i.e. N < 200K at d=256), Ridge starts memorizing the corpus, in-sample retrieval improves due to overfitting rather than due to structure. The signal: a poly-vs-PCA gap that’s too large probably means overfit.M = O(d³) — for d=100 that’s 175K features. Fits in memory, but the Ridge solve is impractical and overfitting climbs.Obvious branches:
d/N does degree 3 start to actually help? Are there embeddings with significant 3rd-order tensor structure that we’re losing now?Loaders for BEIR (SciFact, FiQA, NFCorpus, TREC-COVID), encoders (nomic-v1.5, mxbai-large, bge-base, e5-base), PCA-init with per-axis std, polynomial lift, Ridge OLS, retrieval with NDCG@10/Recall@10 — all in github.com/IvanPleshkov/poly-autoencoder.
Minimal run:
git clone https://github.com/IvanPleshkov/poly-autoencoder.git
cd poly-autoencoder
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt pytrec_eval einops
python beir_eval.py --model nomic-embed-text-v1.5 --dataset fiqa --d 128,256
On an M-series MacBook the first run downloads the model and dataset, encodes the corpus (5–15 min depending on the model), and over the next 5–20 minutes prints a table with NDCG@10 for all four methods.