Why these specific models / versions?
I'm happy they added option for ONNXRuntime. I wish their cv.dnn was mostly that unified wrapper around many different backends (ONNXRuntime, Executorch, LiteRT, CoreAI) and maybe just some tooling around it (performance metrics tools, model downloads etc). Transformers(.js) approach looks better for me.
Wish they also invested more time into better production ready Camera I/O (for mobiles, device/format discovery, manual settings, depthmap support, etc) and better Highgui that could use different backends (skia, webgpu) and on mobiles.
Am I the only one that finds this sentence very cheesey?
Opencv 4.11 : ~255ms Opencv 5.0.0 : ~185ms
with the same code.
So there's room for even better performance!
I'm not interested in understanding papers or the math behind it, but rather in how to put a system into production, whether it's object detection, running 20 cameras in parallel on a single computer, like sizing hardware for a specific task, and so on.
Any tips?
But not for saving video. That fourcc pile of crap doesn't open up in QuickTime player, the default Ubuntu video player, or anything anybody actually uses. I've always had to add a os.system("ffmpeg [ask llm to generate the command for you]") afterwards to fix anything that OpenCV generates.
Large general models have taken over in NLP, and (outside of embedded/low latency applications) it seems like they are coming for CV next.
So you should soon be able to have large generic model that can detect whatever for you.
It's already pretty much possible with open-vocabulary detectors like SAM3, where you could just prompt it with "Apple": https://ai.meta.com/research/sam3/
Then do a slightly more ambitious project. Start with something very simple.
It also heavily depends on what you already know regarding programming, image processing etc.
Speaking from experience: never used OpenCV before, recently vibe coded a tool that makes supercuts of pool videos, trimming each clip from the cue ball's first strike to when the motion stops.
However, it has a few issues:
1. Patented algorithms that are effectively impossible to license in a commercial setting.
2. Permuted API that change how identically named functions behave over versions.
3. Hardware CUDA version coupling deprecating support every major release.
4. Inconsistent and contradictory documentation in the constant subtle permutations. Downstream projects tend to version lock the lib for really practical reasons.
5. A shift away from core C libraries like ImageMagick & V4l, and into C++ abstractions with legacy Swig wrapper libraries in Java or Python.
6. Perpetual-Beta culture means the library will unlikely ever really fully stabilize.
It is a fun library, until people actually try to deploy something serious. As users will often simply suggest using an old version release if there is a bug.
Everything from Build flags to the API documentation has never fully stabilized. ymmv =3
If you need something less restricted to existing labels (say wanting all the red apples, or all cardboard signs) SAM3 is great, as the sibling comment says
Sure, running models on the CPU is very much a thing in computer vision (the benchmarked YOLOv8n has 37M params). But this whole announcement feels more like OpenCV catching up to the modern world, not "The Biggest Leap in Years for Computer Vision"
Still great, needing fewer libraries is a good thing, but maybe a bit oversold
- OpenCV is Apache license. Yes, it used to be more complicated.
- The only patented algorithm I am aware of, SIFT, used to be part of opencv_contrib. And the README in opencv_contrib would greet you with a warning, that the code may not be fit commercial use for various reasons. Only when the patent expired, it was moved into OpenCV core.
- Same observation for Aruco marker detection, which was in contrib for a long time because the options to choose from were either not-well-maintained or GPL-licensed code. It is now in core OpenCV (and Apache).
- Despite its age, I think that OpenCV is still more than relevant today. And being part of modern languages like C++, Swig, Java and Python (and for years already) is part of that. Still I was surprised how long they maintained OpenCV 2 and 3.
- Over the past releases and few years, my impression was actually that core API was very much stable(izing). Cant say what happened in contrib – or what it feels like when you treat core and contribute as one and a feature progressed from contributing to core.
- I do agree, that I usually I would check that a MINOR releases wasnt actually a MAJOR release, breaking some API or behavior I was relying on. I am hoping that Version 5 is pulling the ambitions for making things differently away from Version 4. So v4 can be used stably ;-)
But I can’t really complain because it’s open source and added to by contributors.
A quick note to say that this is also a task you can hand to things like gemini.
https://docs.rs/onnxruntime/latest/onnxruntime/
It’s a Rust wrapper around ONNX Runtime. We currently serve 5+ million inference requests per day for a highly performance-sensitive application, for a long list of major enterprise clients. We don’t use GPUs for inference, because it would be cost-prohibitive. We launch tens of thousands of VMs per day to run these workloads.
Dude, in business we think in terms of large numbers, internationally easily in billion times processing images. This wouldn't cut it.
Also, do you buy the mega expensive super individually designed shoes from the best shoemaker there is to march along though some dirt or simply stick to gumboots?
OpenCV is used behind the scenes for many of the fancy stuff those major AI provider pretend to do. Claude is a huge system and not a LLM anymore.
Like, the AI model tools already exist, all that would be accomplished if OpenCV pivoted would be to take it away for people who want to do low-level vision programming. It wouldn't add anything useful to the world, just destroy an excellent library.
Its a lot better, faster, cheaper to use LLMs for initial labeling together with hand finetuning and then training YOLO with this.
Training YOLO takes a few hours and is then very fast.
I might be on board about LLMs being the future of OCR (though many would disagree), but for general CV they are very inefficient for very limited benefit
We're not going to fit Nano Banana or anything like it on a device with 512MB RAM and a GPU old enough to be irrelevant, and again, API calls just aren't on the menu.
some SBC w/ an industrial camera that is doing pick-place or go/no-go operations on a conveyor belt against a singular object type doesn't need a huge image-gen/llm model governing it.
I mean have you even considered the kind of performance an opencv function can get w/ just mask-matching? I mean even with a fancy YOLO model these answers get thrown out in 1.5-50ms ; this is just a wholly different time scaling.
To enable Intel TBB, CUDA, and CPU specific compiler optimizations... one will almost certainly need to re-build the library, and customize your application build.
Some tasks degrade in performance on a GPU, and others are 740 times faster... ymmv. =3
Indeed, if your library dependency constellation works, some will static link to stabilize/freeze their project for more than a few months.
It wasn't that v3 was particularly good, but rather v4 was a mess. I predict v5 inherited that mess, and improved it... lol =3
Is the image(text) function reversible? Or are they brute force searching a nearest neighbor like word2vec/hash brute forcing.
Where is the human creativity in writing release notes gone?
If a human can't be bothered to write a piece, I can't be bothered to read it.
Due to how simple they are to work with they will become popular. Compare NLP before and after GPT-3. GPT-3 majorly brought down the complexity and skill needed for doing NLP tasks even if traditional NLP is much much faster. Ultimately ease of development will win out and the industry will work towards optimizing running such LLMs to make it cheap enough to run.
Also if they are better then you can also have a flow that’s cheap model -> marginal cases go to more complex thing (and a chain of these).
The yolo models are really shockingly good for their cost and how well they can work with not much training data as well.
Even if they were an option, your 300ms latency requirement would exclude them anyway.
As for being well-written, does that refer to correct use of grammar and no typos, or do you mean that you find that bots write better than humans in any other way?
If someone slapped together an article from an LLM and a few internal documents, that tells me exactly how much they cared about it.
I personally don't mind AI generated content when it's properly reviewed, but unfortunately more often than not the author just glances at the result and decides it's good enough.
Example: https://opencv.org/wp-content/uploads/2026/06/image-1.jpeg
I'm not knowledgable enough to determine whether this diagram is 100% accurate, but some things look off - the arrows in the bottom left seem superficial, some arrows are connected in weird ways, the mini diagram in AttentionLayer block doesn't look right (it has two Softmax icons and one MatMul icon, while the "before" diagram is the opposite).

OpenCV 5 is one of the most important releases in the history of OpenCV.
For more than two decades, OpenCV has been the foundation for computer vision research, robotics, embedded vision, AI applications, industrial inspection, AR/VR, medical imaging, and countless production systems. Today, the library has more than 86,000 GitHub stars, more than a million installs per day, and one of the largest collections of computer vision algorithms in the world.
OpenCV 5 builds on that foundation with a major modernization of the library. It brings a new DNN engine, stronger ONNX support, hardware acceleration improvements, better Python integration, new data types, expanded 3D vision capabilities, improved documentation, and a cleaner architecture for the future.
This is not just another incremental release. OpenCV 5 is a major step forward.
Computer vision has changed dramatically since OpenCV 4.
Modern applications now combine classical vision, deep learning, transformers, large vision models, edge deployment, heterogeneous hardware, and Python-first workflows. Developers expect the same code to run efficiently across laptops, servers, embedded devices, ARM chips, Snapdragon platforms, and specialized accelerators.
OpenCV 5 was designed to meet that reality.
The goals were clear: make the core faster and smaller, improve language support, clean up old APIs, modernize the DNN engine, support new hardware acceleration paths, improve 3D vision tooling, and make the documentation easier to use.
If you have shipped anything with OpenCV in the last few years, you know the feeling. The library does almost everything, but the deep learning side always felt a step behind the models people were really using. You would export a new model to ONNX, point OpenCV’s DNN module at it, and cross your fingers. Sometimes it worked. Sometimes it threw an error about an operator it had never heard of.
In this post we will walk through what is new, why it matters in practice, and what it changes for the code you write. You do not need to know the library’s internals. If you have ever written cv2.imread, you are in the right place.
The pip version of OpenCV5 will be released on 8th June.
Before we get into what changed, it helps to remember how widely used OpenCV is. This is not a niche research tool. It is plumbing for a huge slice of the computer vision world.

(Sources: github.com/opencv/opencv, pypistats.org, embedded-vision.com.)
When a library is this deeply embedded in production systems, every change has to be made carefully. That is part of why a major version takes time, and why it is a big deal when one finally arrives.
It also helps to know who builds it. OpenCV is stewarded by the non-profit OpenCV.org, with development and support coming from Big Vision (which supports the library, OpenCV University, and content like this blog), OpenCV China (a major force behind RISC-V and embedded work) and OpenCV.ai.
The team started OpenCV 5 with a clear list of pain points. If you have used OpenCV for a while, you will recognize most of them:
The rest of this post is that list, made real. We will start with the change that affects the most people.
The single most important number in this release is coverage. OpenCV’s ONNX operator support jumped from roughly 22% in the 4.x days to over 80% in OpenCV 5.
If you have ever fought with OpenCV refusing to load a modern model, that number is the fix. The reason behind it is more interesting than the number itself.
The old 4.x engine imported a small fraction of the ONNX operator set and struggled with anything that had dynamic shapes, which covers most interesting models these days. The 5.x engine was rebuilt around a typed operation graph with proper shape inference, constant folding, and operator fusion. Instead of treating a network as a flat list of layers and walking them one by one, OpenCV 5 understands the model as a graph. That lets it reason about the network, simplify it, and run it far more efficiently.

ONNX operator coverage, then and now.
A few things the new engine handles that the old one could not:
That last point deserves a closer look. One of the headline optimizations is attention fusion. The engine recognizes the classic MatMul → Softmax → MatMul pattern at the heart of every transformer and collapses it into a single fused attention operation, backed by a FlashAttention-style implementation. You get this for free. Load your model, and it runs faster.

| Aspect | Classic engine (4.x) | New engine (5.x) |
|---|---|---|
| Model representation | One struct per layer, walked in order | A typed graph the engine can analyze |
| Shapes | Static only | Symbolic, dynamic |
| Subgraphs | Not supported | If and Loop supported |
| Fusion | Limited | QDQ, BatchNorm, Attention, MatMul, Softmax, and more |
| Memory | Reused per layer | A unified buffer pool that reuses memory aggressively |
The practical result is straightforward. More models load, more models run correctly, and many of them run faster.
Rewrites make people nervous, and rightly so. Nobody wants a working pipeline to break on upgrade day. OpenCV 5 handles this by keeping more than one engine available behind the same DNN API. You choose which one loads your model right where you read it, through an engine argument on the readNet* family of functions. The values come from the cv::dnn::EngineType enum:
| Value | Meaning |
| ENGINE_CLASSIC (1) | Force the old 4.x-style engine. This is the path that supports non-CPU backends and targets such as CUDA and OpenVINO. |
| ENGINE_NEW (2) | Force the new graph engine, with fusion and dynamic shapes. It runs on CPU only for now. |
| ENGINE_AUTO (3) | The default. Try the new engine first, and fall back to the classic engine if the model fails to load. |
| ENGINE_ORT (4) | Use the bundled ONNX Runtime wrapper. ONNX models only, and the build must be configured with WITH_ONNXRUNTIME=ON. |
Because ENGINE_AUTO is the default, most code does not have to do anything special. You read the model, and OpenCV uses the new engine when it can and the old one when it cannot. When you want to pin a specific engine, you pass it at load time.
Python
import cv2 as cv
net = cv.dnn.readNetFromONNX("model.onnx")
""" net = cv.dnn.readNetFromONNX("model.onnx", engine=cv.dnn.ENGINE_NEW) """ net.setInput(blob) out = net.forward()
cpp
#include <opencv2/dnn.hpp> using namespace cv;
// Default behaviour (ENGINE_AUTO). dnn::Net net = dnn::readNetFromONNX("model.onnx");
// Or pin a specific engine at load time. /* dnn::Net netNew = dnn::readNetFromONNX("model.onnx", dnn::ENGINE_NEW); */ net.setInput(blob); Mat out = net.forward();
One practical detail is worth knowing. The new engine is CPU-only at the moment, so if you select a non-CPU backend and target (for example CUDA or OpenVINO through setPreferableBackend and setPreferableTarget), you will want the classic engine.
The OpenCV samples handle this for you by switching to ENGINE_CLASSIC when you pass a non-default backend or target on the command line.
This design keeps upgrade-day risk low. The old engine is still there for anything the new one cannot load yet or cannot accelerate, and the optional ONNX Runtime path (when built in) widens coverage further, all through the same Net API.

Coverage is one thing, and speed is what people argue about. The team benchmarked the new engine head-to-head against ONNX Runtime on CPU across a range of real models. Here are the cases where OpenCV 5 comes out ahead:
| Model | OpenCV 5 DNN (ms) | ONNX Runtime (ms) | Difference |
|---|---|---|---|
| XFeat | 6.56 | 8.61 | 31.25% faster |
| YOLOv8n | 10.9 | 12.15 | 11.5% faster |
| YOLOX-S | 23.46 | 25.16 | 7.24% faster |
| DINOv2 small | 23.78 | 29.58 | 24.4% faster |
| RF-DETR | 102.01 | 106.49 | 4.4% faster |
| OWLv2 | 1,090 | 1,489 | 36.6% faster |
| BiRefNet | 7,178 | 9,503.14 | 32.4% faster |
Hardware: Intel Core i9-14900KS, Ubuntu 24.04 LTS. Lower latency is better. The difference is how much faster OpenCV 5 DNN is than ONNX Runtime on the same model and machine.
The pattern holds across the board. From tiny real-time detectors like YOLO26n to heavyweight open-vocabulary models like OWLv2, OpenCV 5’s native engine is competitive with, and often faster than, a mature and heavily optimized runtime, all while keeping everything inside a single dependency. A comprehensive benchmark can be found at OpenCV5 DNN Benchmark.

Real-time RF-DETR detection running entirely through the new DNN engine.
Better ONNX coverage stays abstract until you see the list of models it unlocks. OpenCV 5 has been validated against a broad, modern lineup spanning detection, segmentation, backbones, and generative models:

If your project depends on any of these, OpenCV 5 means one fewer framework in your dependency list.
This one still surprises people. OpenCV 5 can run large language models and vision-language models directly inside the DNN module, with no separate runtime.
To make that work, OpenCV 5 ships two things that classic CV libraries never needed:
These work across Qwen 2.5, Gemma 3, PaliGemma, and the GPT-2 / GPT-4 family, all through the same Net API you already use for a YOLO model. Vision-language pipelines (image in, text out) are supported through models like PaliGemma.

In the team’s tests, asking Qwen 2.5 “What is OpenCV?” through OpenCV’s engine produced output that matched ONNX Runtime token for token. That is a reassuring sign that correctness was not traded away for the convenience of keeping everything in one library.
Will OpenCV replace a dedicated LLM serving stack for a production chatbot? No, and it does not aim to. What it gives you is a vision pipeline that can reach for a small language or vision-language model for tasks like captioning, OCR post-processing, or open-vocabulary queries, without bolting on a whole separate framework.
One of the most satisfying demos in the release is object removal with LaMa, running entirely inside the new DNN engine. You give it an image and a mask of what to remove, and it fills the hole back in with blended edges and no external runtime.

Mask in, clean image out. LaMa inpainting in a single forward pass.
The flow is as simple as it sounds:
And the code really is this short:
import cv2 as cv
net = cv.dnn.readNetFromONNX("lama.onnx") blob = cv.dnn.blobFromImages([img, mask], scalefactor=1/255.) net.setInput(blob) out = net.forward() # inpainted image
A ready-to-run version lives at samples/dnn/inpainting.py in the OpenCV 5.x branch, and there is a diffusion-based inpainting sample (samples/dnn/ldm_inpainting.py) if you want to go further.
Feature detection and matching is one of OpenCV’s oldest jobs. It powers panorama stitching, image alignment, and a lot of 3D reconstruction. For years that meant SIFT and ORB. OpenCV 5 brings the modern, learned approach into the library as a first-class citizen.
The new Features module, which replaces Features2D, adds a complete neural pipeline for detection, description, and matching:
The classic detectors (SIFT, ORB, FAST, GFTT, MSER) are still here, and the more obscure ones moved to opencv_contrib. So you can adopt the deep learning pipeline where it helps and keep the tried-and-true methods where they are enough.
The stitching module itself gets the upgrade through a new LightGlueFeaturesMatcher, so the classic task of stitching photos into a panorama now benefits from learned matching with no extra glue code on your side.
Not everything in OpenCV 5 is about deep learning. The core that everything else builds on got a serious tune-up, and the benefits show up even in plain image processing code.
New data types. OpenCV 5 adds first-class FP16 (cv::hfloat, CV_16F) and BF16 (cv::bfloat, CV_16BF) types, plus bool, 64-bit integers, and more. These are the data types modern AI workloads use, so having them native means less converting back and forth.
Real N-dimensional and scalar support. cv::Mat can now represent 0D (scalar) and 1D arrays, which tripped people up for years because the old Mat always wanted at least two dimensions. Add broadcasting and first-class N-D operations like transposeND and flipND, and a lot of awkward reshaping code goes away.
Better performance. The team reports up to 2x improvements on mathematical workloads, and the same code now runs across CPUs and accelerators without modification.

Broadcasting comes to OpenCV’s core.
On the language side, the cleanup is just as welcome:
These are the kinds of changes you stop noticing after a week, precisely because they take daily friction away.
One of the quieter but most impactful ideas in OpenCV 5 is a redesigned Hardware Acceleration Layer (HAL).
In the old world, supporting different chips meant scattered conditional code and a lot of duplicated effort. In OpenCV 5, every core function routes through a single, clean HAL contract. Hardware vendors can plug in tuned kernels for their chips, and OpenCV uses them automatically when they are available. Your code does not change; it simply runs faster on supported hardware.

Several vendor-tuned paths are already wired in:
Underneath all of this sits Universal Intrinsics 2.0, a single vector codebase that maps to SSE, AVX2/512, NEON, SVE, RVV, and more. It is mostly an internal detail, but it is the reason one implementation can target so many architectures, and the team reports 3-4x speedups on common ARM operations like resizing and warping because of it.
For you as a developer, the result is simple. Write OpenCV code once, and it uses the best path on whatever hardware it lands on, from a cloud ARM server to a phone to a RISC-V board.
OpenCV’s 3D capabilities have grown a lot over the years, and the old monolithic calib3d module had become something of a junk drawer. OpenCV 5 splits it into three focused modules:
The highlights developers will feel most:
If you work on structure-from-motion, robotics, or any kind of reconstruction, this is a meaningful upgrade rather than a cosmetic reshuffle.
A small but welcome change: the docs have been rebuilt. OpenCV moved from plain Doxygen to a Sphinx + Doxygen pipeline. That brings a persistent left-hand navigation pane, hand-written tutorials sitting right next to the auto-generated API reference, Python signatures shown alongside C++, a link checker in pre-commit, and modern styling.

It sounds minor until you have spent an afternoon hunting for a function signature in the old layout. This is the kind of quality-of-life change that makes the whole library feel more approachable.
If you skim only one section, make it this one. Here is the at-a-glance comparison of OpenCV 5.0 against the 4.x line:
| Feature | OpenCV 5.0 | Compared to 4.x |
|---|---|---|
| ONNX coverage | 80%+ of operators | ~22%, a major jump |
| DNN engine | Rewritten graph engine with fusion and a buffer pool | No modern memory pooling or dynamic shapes |
| Optional backends | Drop-in ONNX Runtime via ENGINE_ORT | Native backend only |
| LLM / VLM support | Built-in tokenizer + KV-cache for Qwen / Gemma / GPT | Not available |
| Dynamic-shape ONNX | Native, via the new shape-inference engine | Brittle, with known bugs |
| 0D / 1D tensors | cv::Mat supports scalars and 1D arrays | Mat required at least 2D |
| Data types | Adds FP16, BF16, bool, 64-bit ints, and more | FP32, INT8, UINT8 mainly |
Target release: June 2026, timed with CVPR 2026 in Denver.
OpenCV 5.0 is a big release, and it is also a foundation. Some of the most interesting pieces are deliberately built into the architecture now so they can be filled in over the 5.x cycle. Two of them stand out, and if you run heavy pipelines, they are the ones to watch.
You may have noticed that every benchmark earlier in this post ran on a CPU. That is not an accident. The new graph engine landed CPU-first, and got that right, before chasing accelerators. Today, if you want GPU inference, you reach for the ONNX Runtime backend (ENGINE_ORT) and its execution providers like CUDA and TensorRT, which is a perfectly good option.
The roadmap is to bring GPU acceleration to the native engine itself. The whole point of the graph-based design, with its typed ops, shape inference, fusion, and unified buffer pool, is that it gives the engine the information it needs to schedule work on a GPU intelligently rather than only on a CPU. As that support matures, you should be able to get GPU speed straight from OpenCV’s own engine, using the same Net API and without pulling in a separate runtime.
This is the quieter half of the story, and arguably the more important one for real pipelines. The new HAL was designed with two paths: the CPU HAL described earlier, and a new non-CPU HAL that exposes the same surface area for GPUs, NPUs, and other accelerators.
Why does that matter? Because in most vision pipelines the model is only part of the work. Around every forward() call there is a stack of image operations: resize, color conversion, normalization, and letterboxing before the model, then non-maximum suppression, mask resizing, and overlay drawing after it. Today a lot of that pre- and post-processing runs on the CPU, which means the data gets copied to the accelerator for inference and back to the CPU for the surrounding steps. Those round trips are often the real bottleneck, not the model itself.
The goal of the non-CPU HAL is to let those everyday imgproc functions run on the same accelerator as the model, so the data can stay put. Keep the frame on the GPU, do the resize and normalize there, run inference there, do the post-processing there, and bring back only the final result. For high-throughput or real-time workloads, removing those copies can matter as much as a faster model.

Together, these two efforts point at the same destination: writing ordinary OpenCV code and having the whole pipeline, not only the model, run on whatever hardware you have, transparently. The plumbing is in place in 5.0, and the rest will arrive across the 5.x series.
OpenCV is a community project, and this release is a good moment to jump in.
The team is gathering feedback during the 5.x cycle, and that feedback shapes the final release. If you try it and something breaks, or something delights you, say so.
OpenCV 5 is big in both senses: big in scope, and big in the day-to-day difference it makes. The rewritten DNN engine alone closes the most common source of frustration with the library, pushing ONNX coverage past 80% and adding dynamic shapes, fusion, and built-in LLM and VLM support. Around it sit a faster, more modern core, an API cleanup that removes years of friction, transparent hardware acceleration through the new HAL, and a much-improved 3D vision toolkit.
What stands out is the restraint. Three DNN engines sit behind one unchanged API, the classic detectors stay alongside the new neural ones, and the old engine is preserved for compatibility. OpenCV 5 modernizes aggressively without leaving its huge existing user base behind.
If you have kept a modern model on the shelf because OpenCV could not load it, this is the release to revisit. Grab the 5.x branch, point it at that model, and see how far things have come.