For example, Sketchy Provider tells you they are running the latest and greatest, but actually is knowingly running some cheaper (and worse) model and pocketing the difference. These tests wouldn't help since Sketchy Provider could detect when they're being tested and do the right thing (like the Volkswagen emissions scandal). Right?
Kimi K2.6, however, is the new open source leader, so far. Agentic evaluations still in progress, but one-shot coding reasoning benchmarks are ready at https://gertlabs.com/?mode=oneshot_coding
If someone actually goes out of their way to bypass the check, that's a pretty different situation legally compared to just quietly shipping a cheaper quant anyway.
For a truly malicious actor, you're right. But it shifts it from "well we aren't obviously committing fraud by quantizing this model and not telling people" to "we're deliberately committing fraud by verifying our deployment with one model and then serving customer requests with another".
I suspect there's a lot of semi-malicious actors who are only happy to do the former.
Edit: Kimi K2 uses int4 during its training as well as inference [2]. I wonder if that affects the quality if different gguf creators may not convert these correctly?
[1] https://openrouter.ai/docs/guides/routing/model-variants/exa...
[2] https://www.reddit.com/r/LocalLLaMA/comments/1pzfuqg/why_kim...
This is probably kimi trying to protect their brand from bargain basement providers that dont properly represent what the models are capable of.
Going to test it out, thanks!
I'm curious what exactly they mean by this...
"because we learned the hard way that open-sourcing a model is only half the battle."
Alongside the release of the Kimi K2.6 model, we are open-sourcing the Kimi Vendor Verifier (KVV) project, designed to help users of open-source models verify the accuracy of their inference implementations.
Not as an afterthought, but because we learned the hard way that open-sourcing a model is only half the battle. The other half is ensuring it runs correctly everywhere else.
You can click here to access the Kimi API K2VV evaluation results for calculating the F1 score.
From Isolated Incidents to Systemic Issues
Since the release of K2 Thinking, we have received frequent feedback from the community regarding anomalies in benchmark scores. Our investigation confirmed that a significant portion of these cases stemmed from the misuse of Decoding parameters. To mitigate this immediately, we built our first line of defense at the API level: enforcing Temperature=1.0 and TopP=0.95 in Thinking mode, with mandatory validation that thinking content is correctly passed back.
However, more subtle anomalies soon triggered our alarm. In a specific evaluation on LiveBenchmark, we observed a stark contrast between third-party API and official API. After extensive testing of various infrastructure providers, we found this difference is widespread.
This exposed a deeper problem in the open-source model ecosystem: The more open the weights are, and the more diverse the deployment channels become, the less controllable the quality becomes.
If users cannot distinguish between "model capability defects" and "engineering implementation deviations," trust in the open-source ecosystem will inevitably collapse.
Six Critical Benchmarks (selected to expose specific infra failures):
Upstream Fix: We embed with vLLM/SGLang/KTransformers communities to fix root causes, not just detect symptoms.
Pre-Release Validation: Rather than waiting for post-deployment complaints, we provide early access to test models. This lets infrastructure providers validate their stacks before users encounter issues.
Continuous Benchmarking: We will maintain a public leaderboard of vendor results. This transparency encourages vendors to prioritize accuracy.
We completed full evaluation workflow validation on Two NVIDIA H20 8-GPU servers, with sequential execution taking approximately 15 hours. To improve evaluation efficiency, scripts have been optimized for long-running inference scenarios, including streaming inference, automatic retry, and checkpoint resumption mechanisms.
Weights are open. The knowledge to run them correctly must be too.
We are expanding vendor coverage and seeking lighter agentic tests. Contact Us: [email protected]