Kimi vendor verifier – verify accuracy of inference providers

I like this idea. This might be one of the more effective social pressures available for getting inference providers to fix long-standing issues. AWS Bedrock, for example, has crippling defects in its serving stack for Kimi’s K2 and K2.5 models that cause 20%-30% of attempts to emit tool calls to instead silently end the conversation (with no token output). That makes AWS effectively irrelevant as a serious inference provider for Kimi, and conveniently pushes users onto Bedrock’s significantly more expensive Anthropic models for comparable performance on agentic tasks.

If I understand correctly, threat model here seems to be to protect against accidental issues that would impact performance, but doesn't cover malicious actor.

For example, Sketchy Provider tells you they are running the latest and greatest, but actually is knowingly running some cheaper (and worse) model and pocketing the difference. These tests wouldn't help since Sketchy Provider could detect when they're being tested and do the right thing (like the Volkswagen emissions scandal). Right?

This is real issue in our benchmarks. Beware of OpenRouter providers that don't specify quantizations or use lower ones than you might be expecting. OpenRouter does provide configuration options for this, and it often limits your options significantly. That being said, even with the best providers, Kimi-K2-thinking was underwhelming and slow on our benchmarks, albeit interesting and useful for temperature/variation.

Kimi K2.6, however, is the new open source leader, so far. Agentic evaluations still in progress, but one-shot coding reasoning benchmarks are ready at https://gertlabs.com/?mode=oneshot_coding

Good to see this exist. Inference providers quietly swap quant levels. Most users never check. A standard verifier from the model maker is the right move, would love to see other labs ship the same

A test that runs for 15 hours on a high powered rig is going to be hard to reproduce or scale. But I think this addresses a widespread concern, which affects all kinds of cloud services. What you ping is not necessarily what you get.

A related article from fireworks.ai about running open weights models and why such verifier needs to exists in the first place

https://fireworks.ai/blog/quality-first-with-kimi-k2p5

After Anthropic, Moonshot is another model provider who restricts tweaking of sampling parameters. I do like the idea of the vendor verifier, though.

Good to see this exist. Inference providers quietly swap quant levels. Most users never check. A standard verifier from the model maker is the right move, would love to see other labs ship the same

A related article from fireworks.ai about running open weights models and why such verifier needs to exists in the first place

https://fireworks.ai/blog/quality-first-with-kimi-k2p5

If I understand correctly, threat model here seems to be to protect against accidental issues that would impact performance, but doesn't cover malicious actor.

Catching accidental drift is still worth a lot. It's basically the same idea as performance regression tests in CI, nobody writes those because they expect sabotage. It's for the boring stuff, like "oops, we bumped a dep and throughput dropped 15%".

If someone actually goes out of their way to bypass the check, that's a pretty different situation legally compared to just quietly shipping a cheaper quant anyway.

Providers like OpenRouter default to the cheapest provider. They are often cheap because they are rediculously quantized and tuned for throughput, not quality.

This is probably kimi trying to protect their brand from bargain basement providers that dont properly represent what the models are capable of.

Yes and no.

For a truly malicious actor, you're right. But it shifts it from "well we aren't obviously committing fraud by quantizing this model and not telling people" to "we're deliberately committing fraud by verifying our deployment with one model and then serving customer requests with another".

I suspect there's a lot of semi-malicious actors who are only happy to do the former.

Seems like a great challenge for all these systems, see fromtier labs serving quants when under hesvy load.

Kimi K2.6, however, is the new open source leader, so far. Agentic evaluations still in progress, but one-shot coding reasoning benchmarks are ready at https://gertlabs.com/?mode=oneshot_coding

Openrouter has an "exacto" [1] option to favour higher quality providers for a given model. Have you found any benefits to using that?

Edit: Kimi K2 uses int4 during its training as well as inference [2]. I wonder if that affects the quality if different gguf creators may not convert these correctly?

[1] https://openrouter.ai/docs/guides/routing/model-variants/exa...

[2] https://www.reddit.com/r/LocalLLaMA/comments/1pzfuqg/why_kim...

My reading of the article is that the first audience for this test is the vendors themselves. The test is long and comprehensive to give the vendor confidence in its own hosting.

You can run the whole suite once at the start for each vendor, then roll through each part of it over a two or four week cycle, mimicking regular use. That jeeps the evaluation up to date over time.

After Anthropic, Moonshot is another model provider who restricts tweaking of sampling parameters. I do like the idea of the vendor verifier, though.

If someone actually goes out of their way to bypass the check, that's a pretty different situation legally compared to just quietly shipping a cheaper quant anyway.

Yes and no.

I suspect there's a lot of semi-malicious actors who are only happy to do the former.

Seems like a great challenge for all these systems, see fromtier labs serving quants when under hesvy load.

You can run the whole suite once at the start for each vendor, then roll through each part of it over a two or four week cycle, mimicking regular use. That jeeps the evaluation up to date over time.

My reading of the article is that the first audience for this test is the vendors themselves. The test is long and comprehensive to give the vendor confidence in its own hosting.

If the post training is done with specific sampling parameters it would make sense to only use the parameters it was trained with.

Openrouter has an "exacto" [1] option to favour higher quality providers for a given model. Have you found any benefits to using that?

Edit: Kimi K2 uses int4 during its training as well as inference [2]. I wonder if that affects the quality if different gguf creators may not convert these correctly?

[1] https://openrouter.ai/docs/guides/routing/model-variants/exa...

[2] https://www.reddit.com/r/LocalLLaMA/comments/1pzfuqg/why_kim...

I did not know about this! We've put a lot of effort into probing providers and their offerings and auto-selecting the best options. I wonder how well their exacto option works.

Going to test it out, thanks!

Providers like OpenRouter default to the cheapest provider. They are often cheap because they are rediculously quantized and tuned for throughput, not quality.

This is probably kimi trying to protect their brand from bargain basement providers that dont properly represent what the models are capable of.

> This is probably kimi trying to protect their brand from bargain basement providers that dont properly represent what the models are capable of.

I'm curious what exactly they mean by this...

"because we learned the hard way that open-sourcing a model is only half the battle."

If the post training is done with specific sampling parameters it would make sense to only use the parameters it was trained with.

I did not know about this! We've put a lot of effort into probing providers and their offerings and auto-selecting the best options. I wonder how well their exacto option works.

Going to test it out, thanks!

> This is probably kimi trying to protect their brand from bargain basement providers that dont properly represent what the models are capable of.

I'm curious what exactly they mean by this...

"because we learned the hard way that open-sourcing a model is only half the battle."

Research

Rebuilding the "Chain of Trust": Kimi Vendor Verifier

Alongside the release of the Kimi K2.6 model, we are open-sourcing the Kimi Vendor Verifier (KVV) project, designed to help users of open-source models verify the accuracy of their inference implementations.

Not as an afterthought, but because we learned the hard way that open-sourcing a model is only half the battle. The other half is ensuring it runs correctly everywhere else.

Official Evaluation Results

You can click here to access the Kimi API K2VV evaluation results for calculating the F1 score.

Why We Built KVV

From Isolated Incidents to Systemic Issues

Since the release of K2 Thinking, we have received frequent feedback from the community regarding anomalies in benchmark scores. Our investigation confirmed that a significant portion of these cases stemmed from the misuse of Decoding parameters. To mitigate this immediately, we built our first line of defense at the API level: enforcing Temperature=1.0 and TopP=0.95 in Thinking mode, with mandatory validation that thinking content is correctly passed back.

However, more subtle anomalies soon triggered our alarm. In a specific evaluation on LiveBenchmark, we observed a stark contrast between third-party API and official API. After extensive testing of various infrastructure providers, we found this difference is widespread.

This exposed a deeper problem in the open-source model ecosystem: The more open the weights are, and the more diverse the deployment channels become, the less controllable the quality becomes.

If users cannot distinguish between "model capability defects" and "engineering implementation deviations," trust in the open-source ecosystem will inevitably collapse.

Our Solution

Six Critical Benchmarks (selected to expose specific infra failures):

Pre-Verification: Validates that API parameter constraints (temperature, top_p, etc.) are correctly enforced. All tests must pass before proceeding to benchmark evaluation.
OCRBench: 5 minutes smoke test for multimodal pipelines.
MMMU Pro: Verify Vision input preprocessing by testing diverse visual inputs.
AIME2025: Long-output stress test. Catches KV cache bugs and quantization degradation that short benchmarks hide.
K2VV ToolCall: Measures trigger consistency (F1) and JSON Schema accuracy. Tool errors compound in agents; we catch them early.
SWE-Bench: Full agentic coding test. (Not open sourced due to dependency of sandbox)

Upstream Fix: We embed with vLLM/SGLang/KTransformers communities to fix root causes, not just detect symptoms.

Pre-Release Validation: Rather than waiting for post-deployment complaints, we provide early access to test models. This lets infrastructure providers validate their stacks before users encounter issues.

Continuous Benchmarking: We will maintain a public leaderboard of vendor results. This transparency encourages vendors to prioritize accuracy.

Testing Cost Estimation

We completed full evaluation workflow validation on Two NVIDIA H20 8-GPU servers, with sequential execution taking approximately 15 hours. To improve evaluation efficiency, scripts have been optimized for long-running inference scenarios, including streaming inference, automatic retry, and checkpoint resumption mechanisms.

An Open Invitation

Weights are open. The knowledge to run them correctly must be too.

We are expanding vendor coverage and seeking lighter agentic tests. Contact Us: [email protected]

Hacker Times

Hacker Times

Kimi vendor verifier – verify accuracy of inference providers

Discussion

Discussion

Rebuilding the "Chain of Trust": Kimi Vendor Verifier

Official Evaluation Results

Why We Built KVV

Our Solution

Testing Cost Estimation

An Open Invitation

Hacker Times

Hacker Times

Kimi vendor verifier – verify accuracy of inference providers

Discussion

Discussion

Rebuilding the "Chain of Trust": Kimi Vendor Verifier ​

Official Evaluation Results ​

Why We Built KVV ​

Our Solution ​

Testing Cost Estimation ​

An Open Invitation ​

Rebuilding the "Chain of Trust": Kimi Vendor Verifier

Official Evaluation Results

Why We Built KVV

Our Solution

Testing Cost Estimation

An Open Invitation