maybe a raspberry pi 4 too.
But this goes beyond what even Apple has, by making it possible to work directly with compressed lossless video on consumer GPUs. You can get hundreds of FPS encoding or decoding 4k 16-bit FFv1 on a 4080, while only reading a few gigabits of video per second, rather than tens and even hundreds of gigabits that SSDs can't keep up. No need to have image degradation when passing intermediate copies between CG programs and editing either.
I don't know much about video compression, does that mean that a codec like h264 is not parallelizable?
One only needs to look at GPU driven rendering and ray tracing in shaders to deduce that shader cores and memory subsystems these days have become flexible enough to do work besides lock-step uniform parallelism where the only difference was the thread ID.
Nobody strives for random access memory read patterns, but the universal popularity of buffer device address and descriptor arrays can be taken somewhat as proof that these indirections are no longer the friction for GPU architectures that they were ten years ago.
At the same time, the languages are no longer as restrictive as they once were. People are recording commands on the GPU. This kind of fiddly serial work is an indication that the ergonomics of CPU programming have less of a relative advantage, and that cuts deeply into the tradeoff costs.
It is usually more reasonable to work with software decoders for really complex formats, or only to accelerate some heavy parts of the decoding where data corruption is really easy to deal with or benign, or aim for the middle ground: _SIMPLE_ and _VERY CONSERVATIVE_ compute shaders.
Sometimes, the software cannot even tell the hardware is actually 'crashed' and spitting non-sense data. It goes even worse, some hardware block hot reset actually do not work and require a power cycle... Then a 'media players' able to use hardware decoding must always provide a clear and visible 'user button' in order to let this very user switch to full software decoding.
Then, there is the next step of "corruption": some streams out there are "wrong", but this "wrong" will be decoded ok on only some specific decoders and not other ones even though the format is following the same specs.
What a mess.
I hope those compute shaders are not using that abomination of glsl(or the dx one) namely are SPIR-V shaders generated with plain and simple C code.
IMO vendors should stop reinventing hardware video encoding and instead assign the programmer time to making libwebrtc and libvpx better suit their particular use case.
This is especially a choke point when you use these codecs for high quality settings. The prediction and filtering steps later in the decoding pipeline are relatively easy to parallelize.
High throughput CODECs like ProRes don’t use arithmetic coding but a much simpler, table based, coding scheme.
P.S.: In video decoding speed is only relevant up to a certain point. That being: "Can I decode the next frame(s) in time to show it/them without stuttering". Once that has been achieved other factors such as power drainage become more important.
The reason to create image sequences is not because you need to send it to other apps, it’s because you preserve quality and safeguard from crashes.
A crash mid video write out can corrupt a lengthy render. With image sequences you only lose the current frame.
People aren’t going to stop using image sequences even if they stayed in the same app.
And I’m not sure why this applies: “this goes beyond” what Apple has, because they do have hardware support for decoding several compressed codecs (also I’ll note that ProRes is also compressed). Other than streaming, when are you going to need that kind of encode performance? Or what other codecs are you expecting will suddenly pop up by not requiring ASICs?
Also how does this remove degradation when going between apps? Are you envisioning this enables Blender to stream to an NLE without first writing a file to disk?
It depends on what you're going for. If you're trying to do the absolute highest fidelity for archiving a blu-ray disk, AMD Epyc reigns supreme. That's because you need a lot of flexibility to really dial in the quality settings. Pirates over at PassThePopcorn obsess over minute differences in quality that I absolutely cannot notice with my eyes, and I'm glad they do! Their encodings look gorgeous. This quality can't be achieved with the silicon of hardware-accelerated encoders, and due to driver limitations (not silicon limitations) also cannot be achieved by CUDA cores / execution engines / etc on GPUs.
But if you're okay with a small amount of quality loss, the optimum move for highest # of simultaneous encodes or fastest FPS encoding is to skip the CPU and GPU "general compute" entirely - going with hardware accelerated encoding can get you 8-30 1080p simultaneous encodes on a very cheap intel iGPU using QSV/VAAPI encoding. This means using special sections of silicon whose sole purpose is to perform H264/H265/etc encoding, or cropping / scaling / color adjustments ... the "hardware accelerators" I'm talking about are generally present in the CPU/iGPU/GPU/SOC, but are not general purpose - they can't be used for CUDA/ROCm/etc. Either they're being used for your video pipeline specifically, or they're not being used at all.
I'm doing this now for my startup and we've tuned it so it uses 0% of the CPU and 0% of the Render/3D engine of the iGPU (which is the most "general purpose" section of the GPU, leaving those completely free for ML models) and only utilizing the Video Engine and Video Enhance engines.
For something like Frigate NVR, that's perfect. You can support a large # of cameras on cheap hardware and your encoding/streaming tasks don't load any silicon used for YOLO, other than adding to overall thermal limits.
Video encoding is a very deep topic. You need to have benchmarks, you need to understand not just "CPU vs GPU" ... but down to which parts of the GPU you're using. There's an incredible amount of optimization you can do for your specific task if you take the time to truly understand the systems level of your video pipeline.
I haven't actually looked into this but it might not be the realm of possibility. But you are generating a frame on GPU, if you can also encode it there, either with nvenc or vulkan doesn't matter. Then DMA the to the nic while just using the CPU to process the packet headers, assuming that cannot also be handled in the GPU/nic
You wouldn't contain FFv1 in MP4, the only format incompetent enough for such corruption.
Apple has an interest against people using codecs that they get no fees from. And Apple don't have a lossless codec. So they don't offer lossless compressed video acceleration.
The idea is that when working as a part of a team, and you get handed a CG render, you can avoid sending a huge .tar or .zip file full of TIFF which you then decompress, or ProRes which loses quality, particularly when in a linear colorspace like ACEScg.
I wouldn't call it a small quality loss. The hardware encoders are tuned for different priorities like live streaming. They have lower quality and/or much higher bitrate.
> If you're trying to do the absolute highest fidelity for archiving a blu-ray disk, AMD Epyc reigns supreme.
You don't need any special CPU to get the highest fidelity as long as you're willing to wait. For archiving purposes any CPU will do, just be prepared to let it run for a long time.
Another reason to use image sequences is that it’s easier to re-render just a portion of the sequence easily. Granted this can be done with video too, but has higher overhead.
But even then why does the GPU encoding change the fact that you’d send it to another NLE? I just feel like there are a lots of jump in thought process here.
Correct, but Epyc "reigns supreme" for anyone caring about performance / total FPS throughput, which is relevant for anyone who cares about TFA at all - the purpose of using GPU is to "go faster", and that's what Epyc offers for use cases that also care about extreme fidelity.
> I wouldn't call it a small quality loss. The hardware encoders are tuned for different priorities like live streaming. They have lower quality and/or much higher bitrate.
Sure. It absolutely depends on your use case. We're using it for RDP/KVM-type video, so for us the quality loss is indeed quite "small". Our users care more about "can I read the text clearly?" and less about color-banding. The hardware accelerators do a great job with text clarity so for our use-case it's not much of a noticeable quality loss. I will admit the colors are very noticeably distorted.
Using 0% of the CPU and GPU for encoding is a HUGE win that's totally worth it for us - hardware costs stay super low. Using really old bottom of the barrel CPU's for 30+ simultaneous encodes feels like cheating. Hardware-accelerated encoding also provides another massive win by tangibly reduces latency for our users vs CPU/GPU encoding (it's not just the throughput that's improved, each live frame gets through the pipeline faster too).
I wouldn't use COTS hardware accelerators for archiving Blu-Ray videos. Hell I'm not even aware of any COTS hardware accelerators that support HDR ... they probably exist but I've never stumbled across one. But hardware-accelerated encoding really is ideal for a lot of other stuff, especially when you care about CapEx at scale. If you're at the scale of Netflix or YouTube, you can get custom silicon made that can provide ASIC acceleration for any quality you like. That said, they seem to choose to degrade video quality to save money all the way to the point that 10-20% of their users hate the quality (myself included, quality is one of the primary reasons I use PassThePopcorn instead of the legal streaming services), but that's a business choice, not a technical limitation of ASIC acceleration (if you have the scale to pay for custom silicon, COTS solutions absolutely DO have a noticeable quality loss, as you argue).
prores decodes faster than realtime single threaded on a decade old CPU too
it doesn't make sense. it's much different with say, a video game, where a texture will be loaded once into VRAM, and then yes, all the work will be done on the GPU. a video will have CPU IO every frame, you are still doing a ton of CPU work. i don't know why people are talking about power efficiency, in a pro editing context, your CPU will be very, very busy with these IO threads, including and especially in ffmpeg with hardware encoding/decoding nonetheless. it doesn't look anything like a video game workload which is what this stack is designed for.
This is a perfect use case for hardware video acceleration.
The hardware encoder blocks are great for anything live streaming related. The video they produce uses a lot higher bitrate and has lower quality than what you could get with a CPU encoder, but if doing a lot of real-time encodes is important then they deliver.
GPUs these days have massive cache often hundreds of megabytes large, on top of an already absurd amount of registers. A random read will often load a full cacheline into a register and keep it there, reusing it as needed between invocations.
The bitstream reader in FFmpeg for Vulkan Compute codecs is copied from the C code, along with bounds checking. The code which validates whether a block is corrupt or decodable is also taken from the C version. To date, I've never got a GPU hang while using the Compute codecs.
But yes from the point of view that a collection of invocations all progressing in lockstep get arithmetic done by vector units. GPUs have just gotten really good at hiding what happens with branching paths between invocations.
The critical difference is that SIMD and parallel programming are totally different in terms of ergonomics while SIMT is almost exactly the same as parallel programming. You have to design for SIMD and parallelism separately while SIMT and parallelism are essentially the same skill set.
The fan-in / fan-out and iteration rotation are the key skills for SIMT.
First, the original code was reverse-engineered, before Apple published an SMPTE document describing the bitstream syntax. Second, I tried my best at optimizing the code for GPU hardware. And finally, I wanted take the learning opportunity :)
And to answer the parent's question, the shaders are written in pure GLSL. For instance, this is the ProRes bitstream decoder in question: https://code.ffmpeg.org/FFmpeg/FFmpeg/src/branch/master/liba...
Video encoding and decoding on the internet is largely a solved problem for everyday users. Most consumer devices now ship with dedicated hardware accelerator chips, to which APIs like the Vulkan® Video extensions provide direct access. Meanwhile, newer codecs are increasingly royalty-free with open specifications — or simply age out of licensing restrictions — making the standards accessible to everyone.
It's easy to forget how demanding 720p H.264 decoding was on CPUs just 18 years ago. That challenge drove intense competition and optimization among software implementations, pushing performance to the limit until hardware decoding finally became commonplace.
In professional workflows, however, performance walls still exist. Editors scrubbing through days of raw camera footage, colorists working with 8K 16-bit masters, VFX artists rendering 32-bit floating-point ACEScg video, and archivists handling extreme-resolution lossless film scans are still performance-bound. Where casual users once tolerated the occasional frame drop, today's professionals are often pushed toward expensive proprietary solutions or liquid-cooled, hundred-core workstations with hundreds of gigabytes of RAM.
This post explores how FFmpeg uses Vulkan Compute to seamlessly accelerate encoding and decoding of even professional-grade video on consumer GPUs — unlocking GPU compute parallelism at scale, without specialized hardware. This approach complements Vulkan Video's fixed-function codec support, extending acceleration to formats and workflows it doesn't cover.
Codecs are algorithms that exploit redundancy and patterns in a signal to compress it for storage or transmission. How easy is it to parallelize codec processing on a GPU?
Take JPEG, the C. elegans of compression codecs, as an illustrative example. Encoding an image requires a 2D frequency transform (partially parallelizable, processing rows then columns), DC value prediction (fully serial), quantization to discard perceptually irrelevant information (fully parallel), and finally Huffman coding (extremely serial). The mix of parallel and serial steps turns out to be the central challenge for GPU codec acceleration. Decoding reverses these steps — but the serial bottlenecks remain just as problematic.
This is the fundamental tension: codec pipelines are riddled with serial dependencies, while GPUs are purpose-built to execute thousands of independent, uncorrelated operations simultaneously.
The historically obvious approach was hybrid decoding: handle the serial steps (like coefficient decoding) on the CPU, upload intermediate results to the GPU, then let the GPU run the parallel steps where it excels.
In practice, this runs into a fundamental problem: GPUs are physically distant from system memory. Even with DMA and high-bandwidth transfers, the round-trip latency often makes hybrid decoding slower than just doing the parallel steps on the CPU — especially given how capable modern SIMD-enabled CPUs have become.
Real-world results with hybrid codec implementations have confirmed this. The dav1d decoder attempted to offload its final filter pass — complex but highly parallelizable — to the GPU, but saw no gain over the CPU, even on mobile. x264 added basic OpenCL™ support, but frame upload latency killed any performance advantage, and the code eventually bitrotted.
These failures have left hybrid implementations with a poor reputation in the multimedia community. The lesson is clear: to be consistently fast, maintainable, and widely adopted, compute-based codec implementations need to be fully GPU-resident — no CPU hand-offs.
Most codecs are designed with ASIC hardware in mind — the dedicated video engines found on modern GPUs and exposed through Vulkan Video. But even ASICs aren't infinitely fast: codecs typically compromise and define a minimum unit of parallelizable work, called a slice or block, representing the smallest chunk that can be processed independently.
Most popular codecs were designed decades ago, when video resolutions were far smaller. As resolutions have exploded, those fixed-size minimum units now represent a much smaller fraction of a frame — which means far more of them can be processed in parallel. Modern GPUs have also gained features enabling cross-invocation communication, opening up further optimization opportunities.
Together, these trends make it genuinely feasible today to implement certain codecs entirely in compute shaders — no CPU involvement required.
Compute-based encoders also have an advantage over ASICs that's easy to overlook: they're unconstrained in memory usage and search time. With enough threads to exhaustively scan each block, matching or even surpassing the quality of software encoders is entirely achievable.
FFmpeg is a free and open source collection of libraries and tools to enable working with multimedia streams, regardless of format or codec. Whilst famous for its codec implementations with handwritten assembly optimizations across multiple platforms, FFmpeg also provides easy access to hardware accelerators.
Crucially, hardware acceleration in FFmpeg is built on top of the software codecs. Parsing of headers, threading, scheduling frames, slices, and error correction/handling all happen in software. Decoding of all video data is the only part offloaded. This combines robust well-tested code with hardware acceleration. We can directly translate the threading of independent frames that software implementations do by dispatching multiple frames for parallel decoding to fully saturate a GPU.
It also allows users to switch between software and hardware implementations dynamically via a toggle, with no differentiation whether hardware decoding is implemented using Vulkan Video or Vulkan Compute shaders.
The widespread usage of FFmpeg in editing software, media players and browsers, combined with the ability to add hardware accelerator support to any software implementation, makes it an ideal starting point for making compute-based codec implementations widely accessible, rather than dedicated library implementations.
The FFmpeg Video Codec version #1, has become a staple of the archival community and in applications where lossless compression is required. It's open, royalty-free, and an official IETF standard.
The work of implementing codecs in compute shaders in FFmpeg began here. The FFv1 encoder and decoder are very slow to run on a CPU, despite supporting up to 1024 slices. This was partly due to the huge bandwidth needed for high-resolution RGB video, and the somewhat bottlenecked entropy coding design.
FFv1 version 3 was designed over 10 years ago, and it was thanks to the archival community, who adopted it, that it gained wide usage. However, the bottlenecks were making encoding and decoding of high resolution archival film scans prohibitively time consuming.
Thus, thanks to the archival community, the FFv1 encoder and decoder were written. They started out as conversions of the software encoder and decoder, but were gradually more and more optimized with GPU-specific functions.
The biggest challenge when encoding FFv1 is working with the range coder system, which lacks the optimizations that, for example, AV1's range coder has. Each symbol (pixel difference value) has each bit having its own 8-bit adaptation value, therefore needing to lookup 32 contiguous values randomly from a set of thousands (per plane!) when encoding or decoding. We speed this up by having a workgroup size of 32, with each local invocation looking up and performing adaptation in parallel, while a single invocation performs the actual encoding or decoding. For RGB, a Reversible Color Transform (RCT) is performed to decorrelate pixel values further. Originally, a separate shader was used for this, which encoded to a separate image. However, the bandwidth requirements to do this for very high resolution images outweighed the advantages. Since only 2 lines are needed to decode or encode images, we allocate width*horizontal_slices*2 images, and perform the RCT ahead of encoding each line with the help of the 32 helper invocations.
APV is a new codec designed by Samsung to serve as a royalty-free, open alternative for mezzanine video compression. Recently, it too became an IETF standard. It's gaining traction with the VFX and professional media production communities, as well as a camera recording format in smartphones.
Unlike most codecs mentioned in this article, APV was designed for parallelism from the ground up. Similar to JPEG, each frame is subdivided into components, and each component is subdivided into tiles, with each tile featuring multiple blocks. Each block is simply transformed, quantized via a scalar quantizer (simple division), and encoded via variable length codes. There is not even any DC prediction.
To implement it as a compute shader, we first handle decoding on each tile in one shader, and run a second shader which transforms a single block's row per invocation.
ProRes is the de-facto standard mezzanine codec, used for editing, camera footage, and mastering. It's a relatively simple codec, similar to JPEG and APV, which made it possible to implement a decoder, and due to popular demand, an encoder.
For decoding, we do essentially the same process as with APV. For encoding however, we do proper rate control and estimation by running a shader to find which quantizer makes a block fit within the frame’s bit budget.
Unfortunately, unlike other codecs on the list, ProRes codecs are not royalty-free, nor have open specifications. The implementations in FFmpeg are unofficial. But due to their sheer popularity, such implementations are necessary for interoperability with much of the professional world. Nevertheless, the developers dogfood on the implementations, and their output is monitored to match the official implementations.
ProRes RAW features a bitstream that shares little in common with ProRes, because it was made for compressing RAW (not debayered) lossy sensor data. It uses a DCT performed on each component, and a coefficient coder which predicts DCs across components and efficiently encodes AC values from multiple components in a normal zigzag order. The entropy coding system is not exactly a traditional variable length code, but closer to exponential coding. Slices feature multiple blocks, with each component being able to be decoded in parallel. Unlike FFv1, there are no limitations on the number of tiles per image, which potentially requires decoding hundreds of thousands of independent blocks. This is great for parallelism, leading to efficient implementations.
The decoder was implemented in a 2-pass approach, with the first shader decoding each tile, and the second shader transforming all blocks within each tile with row/column parallelism (referred to as shred configuration due to being able to fully saturate a GPU's workgroup size limit).
DPX is not a codec, but rather a packed pixel packing container with a header. It’s an official SMPTE standard, and rather popular with film scanners. Rather than being optimally laid out and tightly packing pixels, it can pack pixels in 32-bit chunks, padding if needed. Or it can... not pack pixels, depending on a header switch.
Its being an uncompressed format with loose regulations, made decades ago, means it's rife with vendors being rather creative in interpreting the specifications, in ways that completely break decoding. Thankfully, there's a text "producer" field left in the header for such implementations to sign their artistry with, which can be used to figure out how to correctly unpack without seeing alien rainbows.
All of this comes down to just writing heuristics in shaders. The overhead will never be the calculations needed to find a collection of pixels, but actually pulling data from memory and writing it elsewhere.
VC-2 is another mezzanine codec. Authored by the BBC, based on its Dirac codec, it is royalty-free, with official SMPTE specifications. Its primary use-case was real-time streaming, particularly fitting high resolution video over a gigabit connection with sub-frame latency. Unlike APV or ProRes, it is based on wavelet transforms. Each frame is subdivided into power-of-two sized slices.
Wavelets are rather interesting as transforms. They subdivide a frame into a quarter-resolution image, and 3 more quarter-resolution images as residuals. Unlike DCTs, they are highly localized, which means they can be performed individually on each slice, yet when assembled they function as if the entire frame was transformed. This eliminates blocking artifacts that all DCT-based codecs suffer from.
This also means they're less efficient to encode as their frequency decomposition is compromised. Also, their distortion characteristics are substantially less visually appealing than the blurring of DCTs. This was one of the main reasons they failed to gain traction in post-2000s codecs.
The resulting coefficients are encoded via simple interleaved Golomb-exp codes, which, while not parallelizable, can be beautifully simplified in a decoder to remove all bit-parsing and instead operate on whole bytes.
The codec given as an example at the start, turns out to have a very interesting attack that not only opens the door to parallelization, but also to parallelizing arbitrary data compression standards such as DEFLATE.
The idea is that although VLC streams lack any way to parallelize, VLC decoders, and in fact all codes that satisfy the Kraft–McMillan inequality, can spuriously resynchronize. After a surprisingly short delay, VLC decoders tend to output valid data.
All that's needed is to run 4 shaders to gradually synchronize the starting points within each JPEG stream. JPEG has multiple variants too, such as progressive and lossless profiles, which can also be parallelized to such an extent.
DC prediction can be done via a parallel prefix sum, which is amongst the most common operations done via compute shaders. DCTs can be done via a shred configuration, as with other codecs.
With the release of FFmpeg 8.1, we've implemented FFv1 encoding and decoding, ProRes encoding and decoding, ProRes RAW decoding, and DPX unpacking. GPU-based processing is automatically enabled and used if Vulkan-accelerated decoding is enabled.
The VC-2 encoder and decoder, along with the JPEG and APV decoders, are still in progress and need additional work before they can be merged.
Looking further ahead, the only remaining codecs with meaningful GPU acceleration potential are JPEG2000 and PNG — the rest either have limited practical use cases or don't benefit from compute-based acceleration.
Unfortunately, JPEG2000 — and by extension JPEG2000HT — is unlike most modern codecs, burdened with the worst features of several combined: a semi-serialized coding system that requires extensive domain knowledge and a bitstream complex enough to give most modern bureaucracies pause. Software decoding of JPEG2000 ranks among the slowest of all widely-used codecs, owing to its ASIC-centric design and under-engineered arithmetic coder. Despite all this, it remains the primary codec used in digital cinema, medicine, and forensics.
PNG acceleration is an open question: its viability as a GPU target will depend on how effectively DEFLATE can be parallelized.
Vulkan is often pigeonholed as a graphics API with added compute — but that framing is outdated. Its compute capabilities have evolved to match, and in some cases exceed, dedicated compute APIs. Modern Vulkan offers pointers, extensive subgroup operations, shared memory aliasing, native bitwise operations, a well-defined memory model, shader specialization, 64-bit addressing, and direct access to GPU matrix units. Together, these features enable programmers to optimize at a lower level than more abstracted APIs.
Even so, the Vulkan Compute API is not near its full potential as it doesn't yet expose the full capabilities of SPIR-V™, which as an intermediate representation is remarkably expressive. Support for the broader SPIR-V feature set is actively expanding — untyped pointers and 64-bit addressing are already available, and support for bitwise operations on non-32-bit integer types is on the way.
Competing compute APIs from GPU vendors often bundle hundreds of specialized and specifically optimized algorithm implementations, accessible through more comfortable programming languages — a tempting package. The catch, of course, is vendor lock-in, which can be a serious concern for portable, long-lived software like FFmpeg.
FFmpeg may be no stranger to writing its own implementations of popular algorithms to avoid dependencies, such as hashing functions, sorting algorithms, CRCs, or frequency transforms. But on the other hand, are extensive, object-orientated APIs, actually necessary? Often, formatting data to be used by common implementations takes longer and produces less optimal code than simply writing a small implementation of an algorithm specialized for a given use-case. OOP can in a lot of cases be handled by simply templating via a preprocessor. Linking multiple pieces of code could just be an #include. And, fragile code that targets a singular version of a vendor’s API, which in turn depends on a specific old gcc version, can be replaced by a reliable, lasting, self-sufficient shader.
Vulkan is ubiquitous - from tiny SoCs, to tablets, embedded GPUs, discrete GPUs, and professional server GPUs — and its industry-led governance model creates strong incentives to support new extensions broadly. Constant automated testing is performed using a comprehensive conformance test suite. Lastly, Vulkan enjoys a broad ecosystem of debugging, optimization, and profiling tools, and a large global developer community means that almost any GPU quirk or optimization trick you discover has already been found, documented, and fed back into the specification.
Whether using Vulkan Video or Vulkan compute shaders, Vulkan has become a compelling API to access GPU-accelerated video processing.
FFmpeg download: https://ffmpeg.org/download.html
Khronos® and Vulkan® are registered trademarks, and SPIR-V™ is a trademark of The Khronos Group Inc. OpenCL™ is a trademark of Apple Inc. used under license by Khronos. All other product names, trademarks, and/or company names are used solely for identification and belong to their respective owners.