This seems to be somewhat unwise. Such an insertion would qualify as an anomaly. And if it's also trained that way, would you not train the model to find artificial frames where they don't belong?
Would it not have been better to find a set of videos where something specific (common, rare, surprising, etc) happens at some time and ask the model about that?
This could describe adding a frame of nonsense into an existing video.
It also could describe finding a semantically useful thing in an actual video, where the exact location is randomised by looking at different time crops of the video. For example, finding a book on a desk in a video that's only there in a panning shot, and you then see if it can find it in a 10s cut, 20s cut, 10 minute cut, etc, and near the start/middle/end.
Here's the paper: https://arxiv.org/pdf/2511.21631
> To evaluate the model’s capability in processing long-context inputs, we construct a video “Needle-ina-Haystack” evaluation on Qwen3-VL-235B-A22B-Instruct. In this task, a semantically salient “needle” frame—containing critical visual evidence—is inserted at varying temporal positions within a long video. The model is then tasked with accurately locating the target frame from the long video and answering the corresponding question. During evaluation, videos are uniformly sampled at 1 FPS, and frame resolution is dynamically adjusted to maintain a constant visual token budget.
This potentially sounds more like the former, but I can't find more accurate information on how this works.
Regardless I'd say again that while not the whole story things like this really are useful to know, and can be very important to test - it's really not a given that models can always find anything in their context window, perhaps even more so for video.
I think Gemini analyzes the transcription.
Can I do the same for free with Qwen3?
The github spells it out much better: https://github.com/QwenLM/Qwen3-VL?tab=readme-ov-file#cookbo...
I like the Qwen models and use them for other tasks successfully. It is so interesting how LLMs will do quite well in one situation and quite badly in another.
link to results: https://chat.vlm.run/c/82a33ebb-65f9-40f3-9691-bc674ef28b52
Quick demo: https://www.youtube.com/watch?v=78ErDBuqBEo
Finetuning an LLM "backbone" (if I understand correctly: a fully trained but not instruction tuned LLM, usually small because students) with OCR tokens bests just about every OCR network out there.
And it's not just OCR. Describing images. Bounding boxes. Audio, both ASR and TTS, all works better that way. Now many research papers are only really about how to encode image/audio/video to feed it into a Llama or Qwen model.
Disclaimer: co-founder
We use an embedding model that processes videos and allows you to perform RAG on them.
This is why keeping our governments from eating that tasty apple of "if you can record AND analyse everything there will be so much less crime" and "just give us keys to all private communication, we swear we will just use it to find bad guys". Because someone will, and someone will use it to hit on people they don't like
https://www.theguardian.com/cities/2018/mar/01/smart-cities-...
Now people will say again that this project has been abandoned, which just isn't true (2024):
https://www.dutchnews.nl/2024/06/smart-street-surveillance-o...
Not to mention cloud platforms that collect evidence and process it with all the models and store that information for searching…
Qwen is a video model trained by a Communist government, or technically by a company with very close ties to the Chinese government. The Chinese government also has laws requiring AI be used to further the political goals of China in particular and authoritarian socialism in general.
In the light of all this, I think it's reasonable to conclude that this technology will be used for Big Brother type surveillance and quite possible that it was created explicitly for that purpose.
Note that the returned values are not direct pixel coordinates. Instead, they are normalized to a 0–1000 range. For example, if you ask for a bounding box, the model might output:
```json [ {"bbox_2d": [217, 112, 920, 956], "label": "cat"} ] ```
Here, the values represent [x_min, y_min, x_max, y_max]. To convert these to pixel coordinates, use:
[x_min / 1000 * image_width, y_min / 1000 * image_height, x_max / 1000 * image_width, y_max / 1000 * image_height]
Also, if you’re running the model with vLLM > 0.11.0, you might be hitting this bug: https://github.com/vllm-project/vllm/issues/29595
https://huggingface.co/allenai/Molmo-7B-D-0924
I’m sure they have some cool secret stuff, but they are perhaps not 10 years ahead. Also, I find unlikely that those secrets wouldn’t make it to the public society now, as we are probably close the top of the AI bubble.
- Using pyscenedetect to split each video on a per scene level
- Using the decord library https://github.com/dmlc/decord to pull frames from each scene at a particular sample rate (specific rate I don't have handy right now, but it was 1-2 per scene)
- Aggregating frames in batches of around 256 frames to be normalized for CLIP embedding on GPU (had to re-write the normalization process for this because the default library does it on CPU)
- Uploading the frames along with metadata (timestamp, etc) into a vector DB, in my case Qdrant running locally along with a screenclip of the frame itself for debugging.
I'm bottlenecked by GPU compute so I also started experimenting with using Modal for the embedding work too, but then vacation ended :) Might pick it up again in a few weeks. I'd like to be able to have a temporal-aware and potentially enriched search so that I can say "Seek to the scene in Oppenheimer where Rami Malek testifies" and be able to get a timestamped clip from the movie.
Orwell was a democratic socialist. He was opposed to totalitarian politics, not communism per se.
I can't show any evidence as I don't have such tests, but it's like coding normally vs coding after a beer or two.
For the massive effect, fill it 95% and we're talking vodka shots. 99%? A zombie who can code. But perhaps that's not fair when you have 1M token context size.
There has been some research specifically in this area with what appears to be classic ML models [2], but it's unclear to me if it can generalize to dances it has not been trained on.
Back in 2009 I was working at a place where O2 was a client, and they gave us an API that could identify the cell tower (inc. lat/lng) any of their customers were connected to. The network needs to track this data internally to function, so the API is basically the equivalent of their DNS.
A lot of my side projects involve UIs and almost all of my problems with getting LLMs to write them for me involve "The UI isn't doing what you say it's doing" and struggling to get A) a reliable way to get it to look at the UI so it can continue its loop and B) getting it to understand what it's looking at well enough to do something about it
My take is it fits into the general concept that generalist models have significant advantages because so much more latent structure maps across domains than we expect. People still talk about fine tuning dedicated models being effective but my personal experience is it's still always better to use a larger generalist model than a smaller fine tuned one.
prompt: I attach a screenshot (1920x1080). Write code to click the submit button using pyautogui.
attachment: <screenshot>
reply:
import pyautogui
pyautogui.click(100, 200)Doesn't that pretty much cover Palantir as well?
> [Nineteen Eighty-Four] was based chiefly on communism, because that is the dominant form of totalitarianism, but I was trying chiefly to imagine what communism would be like if it were firmly rooted in the English speaking countries, and was no longer a mere extension of the Russian Foreign Office.
And of course Animal Farm is only about communism (as opposed to communism + fascism). And the lesser known Homage to Catalonia depicts the communist suppression of other socialist groups.
By all this I just mean to say when you're reading Nineteen Eighty-Four what he's describing is barely a fictionalization of what was already going on in the Soviet Union. There's just not a lot in the book that is specifically Nazi or Fascist.
I don't have any opinion on whether he thought there were non-totalitarian forms of communism.
>it's still always better to use a larger generalist model than a smaller fine tuned one
Smaller fine-tuned models are still a good fit if they need to run on-premises cheaply and are already good enough. Isn't it their main use case?
Content
A few months after launching Qwen3-VL, Alibaba has released a detailed technical report on the open multimodal model. The data shows the system excels at image-based math tasks and can analyze hours of video footage.
The system handles massive data loads, processing two-hour videos or hundreds of document pages within a 256,000-token context window.
In "needle-in-a-haystack" tests, the flagship 235-billion-parameter model located individual frames in 30-minute videos with 100 percent accuracy. Even in two-hour videos containing roughly one million tokens, accuracy held at 99.5 percent. The test works by inserting a semantically important "needle" frame at random positions in long videos, which the system must then find and analyze.
The needle-in-a-haystack test measures the model's ability to locate specific frames in long videos. | Image: Alibaba
In published benchmarks, the Qwen3-VL-235B-A22B model often beats Gemini 2.5 Pro, OpenAI GPT-5, and Claude Opus 4.1 - even when competitors use reasoning features or high thinking budgets. The model dominates visual math tasks, scoring 85.8 percent on MathVista compared to GPT-5's 81.3 percent. On MathVision, it leads with 74.6 percent, ahead of Gemini 2.5 Pro (73.3 percent) and GPT-5 (65.8 percent).
Ad
THE DECODER Newsletter
The most important AI news straight to your inbox.
✓ Weekly
✓ Free
✓ Cancel at any time
Gemini's older 2.5 Pro model maintains a slight lead in general image understanding. | Image: Alibaba
The model also shows range in specialized benchmarks. It scored 96.5 percent on the DocVQA document comprehension test and 875 points on OCRBench, supporting 39 languages - nearly four times as many as its predecessor.
Qwen3-VL achieves over 70 percent accuracy on OCR tasks in 32 of the 39 supported languages. | Image: Alibaba
Alibaba claims the system demonstrates new capabilities in GUI agent tasks. It achieved 61.8 percent accuracy on ScreenSpot Pro, which tests navigation in graphical user interfaces. On AndroidWorld, where the system must independently operate Android apps, Qwen3-VL-32B hit 63.7 percent.
The model handles complex, multi-page PDF documents as well. It scored 56.2 percent on MMLongBench-Doc for long document analysis. On the CharXiv benchmark for scientific charts, it reached 90.5 percent on description tasks and 66.2 percent on complex reasoning questions.
It is not a clean sweep, however. In the complex MMMU-Pro test, Qwen3-VL scored 69.3 percent, trailing GPT-5's 78.4 percent. Commercial competitors also generally lead in video QA benchmarks. The data suggests Qwen3-VL is a specialist in visual math and documents, but still lags in general reasoning.
The technical report outlines three main architectural upgrades. First, "interleaved MRoPE" replaces the previous position embedding method. Instead of grouping mathematical representations by dimension (time, horizontal, vertical), the new approach distributes them evenly across all available mathematical areas. This change aims to boost performance on long videos.
Recommendation
Qwen3-VL combines a vision encoder and language model to process text, images, and videos simultaneously. DeepStack uses visual information from different processing levels. | Image: Alibaba
Second, DeepStack technology allows the model to access intermediate results from the vision encoder, not just the final output. This gives the system access to visual information at different levels of detail.
Third, a text-based timestamp system replaces the complex T-RoPE method found in Qwen2.5-VL. Instead of assigning a mathematical time position to every video frame, the system now inserts simple text markers like "<3.8 seconds>" directly into the input. This simplifies the process and improves the model's grasp of time-based video tasks.
Alibaba trained the model in four phases on up to 10,000 GPUs. After learning to link images and text, the system underwent full multimodal training on about one trillion tokens. Data sources included web scrapes, 3 million PDFs from Common Crawl, and over 60 million STEM tasks.
In later phases, the team gradually expanded the context window from 8,000 to 32,000 and finally to 262,000 tokens. The "Thinking" variants received specific chain-of-thought training, allowing them to explicitly map out reasoning steps for better results on complex problems.
All Qwen3-VL models released since September are available under the Apache 2.0 license with open weights on Hugging Face. The lineup includes dense variants ranging from 2B to 32B parameters, as well as mixture-of-experts models: the 30B-A3B and the massive 235B-A22B.
While features like extracting frames from long videos aren't new - Google's Gemini 1.5 Pro handled this in early 2024 - Qwen3-VL offers competitive performance in an open package. With the previous Qwen2.5-VL already common in research, the new model is likely to drive further open-source development.