> At 1M tokens, SubQ 1.1 Small requires 64.5x less compute than dense attention and runs 56x faster than FlashAttention-2.
Awesome stuff. Solving context at the model architecture layer rather than trying to bolt on extra memory is the right direction IMO.
FlashAttention-2 is not used anymore for at least 2y.
This architecture would have been a massive improvement 3 years ago, but it is a ~solved~ problem IMO.
If the results persists from 1M to 12M, why not 24M or 48M? Sounds almost too good to be true.
With back of the napkin math from inside my head, that'd be like 0.5/1 million LOC, depending on language/code density, could just fold the entire codebase into one prompt if it's a small one, that'd be neat :)
I get why they aren't disclosing all the details, but it seems more hype-train-esque to me for this moment. I don't disagree that this could be big.
- guided window attn. Predict where to attend to but in a fixed window. If you do this to just the token/vocab you can keep effectively unlimited context and perfect recall. (yes, I can do that. There is a trick to teaching it how to predict position. This also immediately opens other crazy things like NN memory)
-efficient fixed state size models. So not a recurrent mechanism because that breaks training, parallelizable like transformers, but fixed sized state instead of unbounded attn. Pick a reasonable amount of state and it is amazingly good since it doesn't need to keep separating wheat fro chaff in context (yes, it is possible to build this, I have. It works. This also opens up real streamed models. I have a true infinite context streamed model I toy with locally that I am getting to be audio/text in and audio/text out in real time.)
Put those together and you have O(1) token gen, infinite context and perfect recall. It is a whole new world of models. You can interact with a model until you have it at the state you want and then save its state and use that as if it were your system prompt. Batches pack perfectly so inference is massively more efficient. Training is massively more efficient. Transformer and unlimited attn models are a dead end. But how do you make money on this as an independent researcher? If I release the Two Weird Tricks this is all based on I get zip and the big players get even more tech for free. If I keep it all secret I get Zip and eventually the tricks will be figured out. (Yes a little frustration here) If anyone wants the model architecture of the future make me an offer :)
My guess is that they're angling for an acquisition.
Because the cost to OpenAI to make an architectural shift is far greater than the cost to a new lab to try something different, providing details is usually a net benefit for recruiting, building trust, getting acquired, etc. The lack of details is a poor business decision because it makes them seem untrustworthy.
I'm not advocating that they should open source their model, but there is already so much noise in the space and many bad papers that being cagey is a poor strategy for winning over talent, developers, etc.
This is what I've thought was going to happen ever since they publicized their efforts. They probably don't have the money to train large models themselves, might as well get a nice chunk of change by being acquired by someone who already has said large models running.
Local inference is insanely fast on my M4 Pro MBP though, so I can understand where you're coming from, but I don't need it too much faster. I still need time to review, test, review and provide feedback to the model. Fast is okay I guess for true vibe coding.
OpenAI validating it can still happen faster than they can get the compute to serve the models themselves[1]. It doesn't make a lot of sense to give out details if they want to be a serious contender or even as some have said, be acquired.
Yeah there's noise but if they have the real deal then it doesn't matter. They only thing they need to do is let people pay to use the models.
[1] I'm assuming this is the primary cause of the delay. That may not be the case of course.
Is it, though? This scrappy startup was able to take a large(-ish) open weights model and adapt it. Why can't the frontier labs do the same cost effectively?
>If they want to get acquired, then they should show that they know what they're doing.
I'm sure they would do so under an appropriate NDA as part of negotiations. I'm not sure why you think a full public disclosure is necessary.
Date
June 16, 2026
The hardest enterprise AI problems share a common shape. They require reasoning over complete artifacts: entire codebases, document collections, contracts, financial filings.
For years, the industry worked around this problem by building retrieval pipelines, chunking strategies, and agentic scaffolding — useful tools, but ultimately workarounds for context limitations of the model architecture. The underlying constraint was attention: compute that scales quadratically with context length, making direct reasoning over large artifacts prohibitively expensive.
SubQ is built to remove that constraint. Today we're releasing the model card for SubQ 1.1 Small — the second iteration of our Subquadratic Sparse Attention (SSA) model, at the smallest size. We are in the process of deploying SubQ 1.1 Small with select design partners and plan to deploy a broader lineup of models ranging from 2M to 12M tokens later in the year.
These results reflect the scaling advantage that SSA's efficiency gains make possible.
SubQ 1.1 Small was evaluated across five axes, covering long-context retrieval, context-length generalization, knowledge, coding, and long-horizon agentic tasks.
We selected Needle-In-A-Haystack (NIAH) and Nvidia's RULER test because together they test whether the model can find a single fact buried deep in a large context, and whether it can connect the dots across that context.
NIAH is the precision test. It places one retrievable fact at a controlled depth within a long context and asks the model to return it exactly. SubQ 1.1 Small scores near-perfect at 1M, 2M, 6M, and 12M tokens. The model was trained predominantly at 1M tokens yet the retrieval held near perfectly at 12x that length, despite compressing attention to just 0.13% of relationships. This generalization is a direct consequence of SSA routing attention based on content relevance rather than fixed positional patterns.
RULER is the capability test. It's 13 tasks go beyond single-fact lookup to cover multi-hop variable tracing, frequency extraction, and aggregation across the full context using the kind of reasoning complete-artifact workloads actually require. SubQ 1.1 Small scores 99.12% at 128K.
Multi-task retrievalRULER (128K)
99.12%
128K
Single-fact retrievalNeedle-in-a-haystack (1M–12M)
100%
1M
100%
2M
98%
6M
98%
12M
SubQ 1.1 Small balances long-context optimization with general reasoning ability without compromise. GPQA Diamond at 85.4% sits just below mid-tier frontier models and well above the smaller tier. LiveCodeBench at 89.7% pass@4 is close to the absolute frontier. AutomationBench Finance at 13% places SubQ 1.1 Small close to the strongest models on that benchmark, ahead of mid-tier and smaller baselines. Absolute scores remain low across all models on this benchmark.
| Benchmark | SubQ 1.1 Small | ||||||
|---|---|---|---|---|---|---|---|
| Graduate-level science |
GPQA Diamond · pass@1
| 85.4 | 93.2 | 92 | 87.5 | 87.5 | 81.7 | 67.2 | |
Agentic finance
AutomationBench
| 13% | 18% | 16% | 8% | 0% | n/r | 3% | |
Competitive programming
LiveCodeBench v6 · pass@4
| 89.7 | 92 | 92.2 | 88.9 | 78.6 | 78.2 | 69.7 |
n/r = result not reported by the model provider
SSA replaces the O(n²) dense attention pass with a learned sparse formulation that scales linearly with context length. SSA's advantage over dense attention grows as context length increases. At 1M tokens, SubQ requires 64.5x fewer compute than dense attention and runs 56x faster than FlashAttention-2 on a single attention layer. In practice, this drastically changes the economics of long-context training and inference.
A full breakdown of the mechanism and how it compares to FlashAttention, DeepSeek sparse attention, and recurrent architectures is in the Technical Report.
SubQ uses 64.5x less compute than dense attention, and is 56× faster than FlashAttention-2 at 1M-token context
We started with an existing open-weight frontier model, replaced dense attention with SSA, and built long-context capability through staged context extension (262K, 512K, 1M, 2M) followed by roughly one trillion tokens of continued pretraining on naturally long artifacts: books, documents, and repository-scale code.
The strongest lever we found for improving long-context retrieval was long-context continued pretraining, made possible by the efficiency of the SSA algorithm. The 12M generalization result reflects both factors: SSA's selection criterion is independent of absolute position, and the capability to use that generalization reliably develops through training on long data.
Additionally, we ran more than one hundred experiments across six to seven model generations to get the balance of capabilities between long- and short-context tasks right. That kind of iteration is only possible because SSA enabled our team to run multi-million-token experiments as a standard procedure rather than a rare event, making the research loop more efficient.
SubQ is designed for workloads that require reasoning over information distributed across the artifact without fragmentation. Here are just a few of the use cases from our initial research:
We'll be kicking off with the first cohort of design partners in the next few weeks, with broader rollout through the quarter and general model releases by end of year.