I have two favorite bug stoies. The first is from a printout from the run of an IBM 360 assembly language program when I was just learning. Someone asked em why their program failed to run. I glanced quickly at the front page of the printout and it said "Too Long". So I told the person that was the problem. Something was too long. He looked at me very strangely, so I looked back at the page a little more closely, only to notice "Too Long" was in the name field of the person running the program. He was Vietnamese and his name was Too Long - literally. There is a powerful lesson (at least one) there.
The other happened when I was implementing some AppleTalk protocols - NBP to be exact. (Don't ask). I would capture the working packets then compare all the checksums, headers, constants, length fields in the packet my code generated and fix any problems. I was stuck on one failure. I just could not see any difference as I went through byte by byte, time after time. It was late and time to go home so I decided to print off each packet on paper and compare them later - certain I was missing something. The problem was instantly obvious. One printout took a page, the two pages. I had been appending junk data in the packet. Sigh
It is impossible to parse the UDP or TCP port number out of a fragment. This is surely the reason the ACL module entirely rejects them. TCP will adjust it's segment size based on PMTUD so as to not require fragmentation. This is why it hasn't been noticed so far. But fragmented UDP packets are a corner case of normal behavior and it boggles the mind that someone could just decide to completely drop them.
UDP fragment filtering could be implemented by a global fragments on/off setting (works for "allow everything" = fragments on, cautious = fragments off) or by blocking the first fragment which includes the port number (and blocking it if the port number is split across fragments which I think is technically allowed but completely abnormal).
If https://github.com/pion/sctp/issues/12 had happened (not just in Pion but across all implementations) this could have been fixed years ago. The hardcoding we all settle for is tragic.
Pair it with the anti-solution of dropping large packets instead of truncating them and we get our perfect storm of bad design that is MTU incompatibility and modern MTU discovery.
And then our authentication stopped working on simulated iOS devices (while still working on the real devices!). After hours of frantic debugging and staring at Wireshark dumps, I found the issue: HTTP3 and QUIC. Apparently, the simulated stack was not tracking the MTU correctly and was trying to send 1506-byte UDP packets.
The "fix" was to add deny rules for UDP ports 80/443 to our firewall.
This started as a blank page on one device and ended two weeks later at the intersection of two bugs: webrtc-rs hardcodes INITIAL_MTU=1228 [never updated, no path probing, retransmits at the same size forever], and Tailscale's packet filter classifies any IPv6 packet with a Fragment header as unknown protocol, so the default deny fires. On every platform, counted under reason="acl". Neither is unreasonable alone. Together: silent wedge, every health check green, because everything that tests the path is small and only the payload fragments. Two-command repro on any tailnet: ping -s 100 works, ping -s 1400 over the Tailscale IPv6 address is 100% loss. Full WebRTC repro and captures: https://github.com/phact/mtu-webrtc-bug. We've reported upstream to both projects https://github.com/tailscale/tailscale/issues/20083 and https://github.com/webrtc-rs/webrtc/issues/806. Happy to answer questions. Especially interested if anyone knows the history behind the IPv6 fragment decision in Tailscale's filter.
Agreed. The port-number point is the most plausible rationale I've heard, more convincing than the RFC line in their source comment. The historical fix for "can't classify fragments" was virtual reassembly or flow tracking [conntrack on linux, scrub in pf], so dropping them outright punts past known prior approaches. Even your lighter idea would have saved us: a first-fragment match would have let our pair through.
We've reported upstream to both projects, tailscale/tailscale#20083 and webrtc-rs/webrtc#806, and webrtc-rs already invited a PR.
I’d venture to guess based on this outcome that fragmented UDP over IPv6 isn’t really an ordinary occurrence. Given the preponderance of HTTPS traffic, the aversion to fragmentation in IPv6, and the weird corner case of there being a hardcoded packet size in webrtc, it’s reasonable to assume that this is a corner case.
A good one to be aware of, but not common.
"The hardcoding we all settle for" might be the epigraph for the whole incident. webrtc-rs invited a PR for the configurable-MTU + better default half [webrtc-rs/webrtc#806] to unblock folks today. Whether PMTUD gets implemented will be interesting to see.
last I checked, all browsers silently fail if it's too big.
I added this in Pion here[0] and I remember testing against Chrome + FireFox and it seemed to work great!
[0] https://github.com/pion/webrtc/commit/e4ff415b2bff31382bdb80...
Welcome to networking mistakes, I guess. I can't remember the specifics but I once encountered a router that would drop traffic that looked like encapsulated TCP at a certain offset, or something like that. They couldn't fix it because the behavior was hardwired. I knew of it because I worked with the firmware team.
Factorio discovered that UDP packets with a checksum of 0x0000 get dropped by some devices.
If you're not familiar with how p2claw works, it's worth checking out the how it works blog post before diving into this one.
I opened one of my p2claw apps on my iPad and got a blank page. The same URL was working on my Mac, my linux box and my phone. On the same wifi, same browser engine, same network.
Like in a good detective story, we came up with a bunch of suspects [the iPad, then WebKit, then Tailscale] and they all turned out to be innocent. Sort of. It turned out to be two bugs wearing a trenchcoat: a hardcoded constant in webrtc-rs, and a one-line design decision in Tailscale that we found through sheer stubbornness. We had a workaround patched the same day, but understanding what we had actually patched took two more weeks.
The app loaded enough HTML to paint the loading state and then hung. There were no relevant console errors, the Service Worker registered, the WebRTC handshake finished, the data channel opened [dc.readyState === "open"], and then nothing. The browser sent its first GET / over the data channel and waited forever for the response.
The box agent on the other end thought everything was fine. It had served the response and pushed the bytes onto the channel. They just never made it to the iPad.
If that wasn't tricky enough, it was a heisenbug: if I refreshed like crazy, the page would sometimes load.
The first useful thing we did was log both ends of the connection and line the logs up by clock time: every chunk the box sent, every chunk the browser received, and, crucially, how much data the box was holding in its outbound buffer waiting to be confirmed delivered. That helped us figure out where the data was not making it to the other end.
After discarding everything up to and including the webrtc handshake, we were grasping at straws. We checked some webrtc specific limits and double checked network stability.
maxMessageSize) off both devices. The iPad reported 64kb, exactly the same as the Mac, and far above the 7-8kb chunks we were sending. After this, we felt like we had discarded message chunk size as a culprit, which ended up making the true diagnosis harder to arrive at.It had to be something specific to the iPad, but we had no idea what.
Per request, the box sent three chunks: a 220 byte header, a 7,874 byte body, and a 199 byte tail. Our new instrumentation showed the sender's outbound buffer climb to about 8kb and stop. It was holding the body it had "sent" but could never get confirmation it had arrived. When the ipad refreshed, we saw the same identical pattern.
WebRTC data channels guarantee in-order delivery on top of lossy UDP, so one missing chunk blocks subsequent messages. On the iPad, in the browser's js console, we saw exactly one chunk being received [the 220 byte header] and then nothing. We didn't see the body or the small headers of the following requests.
We tested on Safari on the Mac, guessing the issue might be WebKit since it happened on every ios browser [and all ios browsers are webkit under the hood], but the Mac was receiving 8kb and 11kb chunks without a hiccup.
After two hours of WebKit theories, I realized that, unlike the Mac, the iPad had Tailscale enabled.
Tailscale is a VPN, and a VPN wraps your traffic in an extra layer that leaves less room in each packet. So the big responses got sliced into more, smaller pieces on the way to the iPad than they did to the Mac. WebKit implements data channels itself, in userspace, including reassembling big messages from the packets that carry them. Our theory evolved toward a bug in webkit message reassembly.
We capped the box's messages at 800 bytes, small enough that each one rode a single packet, and the iPad loaded instantly, Tailscale on or off. It felt like case closed [actually a first attempt at 1,200 bytes, which Claude helped me calculate should fit, mysteriously didn't work. Hold that thought].
In hindsight, we had just discovered that the issue was the VPN, and yet we stuck to our WebKit theory. Given our context bloat [both mine and the agents', this is troubleshooting in the age of AI after all], the Tailscale discovery got absorbed into the WebKit theory instead of challenging it. We could have looked at the network and the webrtc sender, but instead we took it as one more reason the browser was at fault. So we wrote the incident up as an iOS Safari bug [the device gets the packets but never reassembles them for the app] and started building a standalone reproduction to prove it.
For the next two weeks, the bug didn't repro with a JavaScript sender, so we turned to a webrtc-rs based Rust sender. Still nothing. We matched the data channel chunk shapes and sizes, and used a real browser receiver both on Linux and on the iPad, with and without Tailscale. It delivered everything, every time. Eventually we had to re-read our own evidence [actually Anthropic released Fable and I had it dig up the jsonl logs from the original debugging session].
The decisive numbers were in WebRTC's own getStats() counters, which our client logs to the console and which we'd captured in photos of the screen during the incident. The iPad's candidate pair froze at 2,144 bytes received across 18 packets, while the data channel had delivered exactly one message [266 bytes, our 220 byte header plus framing]. The box was retransmitting the big packet the whole time. If Safari were getting those packets and merely failing to stitch the message back together, the transport counter should have climbed by another kilobyte-plus with every retransmission while the message stalled. It never moved. The packets were not arriving at all.

Actual photo from the night of the incident. Every number that mattered is in frame, but it took us two weeks to understand them.
So we stopped trying to reproduce a browser bug and reproduced the network instead.
webrtc-rs, the Rust WebRTC stack our box uses, cuts its outgoing data-channel messages into packets sized against this:
// sctp/src/association/mod.rs
pub(crate) const INITIAL_MTU: u32 = 1228;
It's not configurable and nothing ever updates it. The 1,228 byte packet plus the encryption layer that wraps it comes out to 1,265 bytes on the wire. Add the 28 bytes of UDP and IPv4 headers, or 48 for IPv6, and that's a 1,293 byte packet over IPv4, or 1,313 bytes over IPv6. Tailscale's tunnel carries at most 1,280.
It turns out that the packet being too big is not fatal by itself. When the kernel routes a large packet into the tunnel, it does the polite thing the IP layer has done since the eighties: it fragments. It sends two pieces over the wire, each under the limit, and they get reassembled on the other side. We confirmed this with tcpdump. The fragments leave the box. On a healthy path everything arrives and the bug is invisible, which is exactly why our standalone repro kept passing.
We were back to the drawing board. In the repro, the packets fragmented and reassembled neatly; in the incident, the iPad froze. So the question wasn't why the packet was too big. It was: where did the fragments go?
To answer that, we went back to the real thing. We cranked the box agent's chunk cap back up to 8kb, served a real app through it, and loaded it on the iPad over Tailscale while capturing on the tunnel interface.
It wedged on cue, and this time we were watching both layers at once. The agent's outbound buffer froze at 13kb [not 8kb because different app, different payload]. On the wire, the same 1,265 byte payload left as two IPv6 fragments and got retransmitted on SCTP's textbook backoff schedule: +1.2s, +2s, +4s, +8s. Identical fragments every time, never acknowledged. And the whole time, small packets kept flowing in both directions like nothing was wrong. Heartbeats, acks for old data, connectivity checks, all fine. The connection looked perfectly healthy except for the actual data payloads.
Then a Linux laptop on the same tailnet loaded the same app through the same tunnel just fine. Which gave us the experiment that cracked the whole thing open.
If fragments were dying somewhere on the iPad's path, we didn't need WebRTC to prove it. We tried ping.
A 1,400 byte ping forces fragmentation through a 1,280 byte tunnel. A 100 byte ping doesn't. Run both, over both address families, and you get a truth table:
ping -s 100 <ipad over IPv4> 3/3 received
ping -s 1400 <ipad over IPv4> 3/3 received fragments fine
ping -s 100 <ipad over IPv6> 3/3 received
ping -s 1400 <ipad over IPv6> 0/3, 100% loss fragments gone
IPv4 fragments reassemble. IPv6 fragments vanish. Deterministically, every run.
It wasn't an iOS thing: every Tailscale device we pointed this at exhibits the same packet loss. There is something in Tailscale itself, on every platform, that eats IPv6 fragments.
Tailscale's client keeps diagnostic counters, and on the receiving machine one of them increments when we ping -s 1400 over IPv6. The output of tailscale metrics print includes:
tailscaled_inbound_dropped_packets_total{reason="acl"} 6
Three pings, two fragments each, six drops. The arithmetic matched on every machine we checked. The kernel's own IPv6 reassembly counters stayed at zero the whole time; the fragments were being dropped before the operating system ever saw them.
reason="acl" means the packet filter dropped them as a policy denial. Which is a strange thing to see on a personal tailnet whose access policy is allow everything. So we went to github to have a look at the source [Tailscale's client is open source, which made this whole hunt possible]. There we learned that their IPv6 parser doesn't parse fragments. Any packet carrying an IPv6 Fragment header gets classified as "unknown protocol," and an unknown-protocol packet can't match any allow rule, so the default deny fires. The comment in the code reads:
Note that this means we don't support fragmentation in IPv6. This is fine, because IPv6 strongly mandates that you should not fragment.
It's a reasonable-sounding line, and I think it's a misreading. IPv6 forbids routers from fragmenting packets in flight. It fully allows the sender to fragment, and the spec requires the receiving end to put the pieces back together. Our sending kernel was following the rules. Tailscale's filter drops what the kernel produced, by design, silently, and files it under "acl." IPv4 fragments, for what it's worth, get proper handling and sail through, which contributed to our heisenbug.
https:// traffic negotiates its packet size up front [TCP MSS clamping] so it never oversends. The kind of traffic WebRTC uses has no such negotiation, and almost nothing else sends large UDP over v6 without its own MTU handling, so the trap sits unsprung until something like webrtc-rs walks into it.The two-command version needs nothing but a tailnet with two devices:
ping -s 100 <any tailscale IPv6 address> works
ping -s 1400 <any tailscale IPv6 address> 100% loss
The full WebRTC version is at github.com/phact/mtu-webrtc-bug: a tiny relay that drops oversized packets [the deterministic stand-in for the fragment-eating path], plus captures from the real tunnel and a writeup of the localization, diagnostics/who-loses-the-packets.md. Next, we're reporting the constant to the webrtc-rs maintainers with a suggested fix, and we're filing an issue for the fragment drop in the tailscale repo.
Packets that are too big for the path, silently vanishing, with nobody told why, is one of the internet's oldest problems. It never got solved so much as papered over, and it resurfaces whenever new software sends its own packets without checking what the path accepts. If you're building anything like that [video calls, games, peer-to-peer anything], assume a real fraction of your users are on a path smaller than you'd expect, and either keep your packets conservatively small or probe before you trust.
Neither project here did anything crazy. webrtc-rs picked a constant [which by the way is only 28 bytes more optimistic than Chrome's] and trusted the network to cope. Tailscale decided IPv6 fragments weren't worth supporting and trusted that nothing legitimate sends them. Both decisions are defensible in isolation. Together they form a trap with no error message, where the only symptom is a blank page on one specific device, and are liable to cost you a week or two.
Part of me wants to say we only hit this because p2claw uses things in weird ways. Which is true, and is also the whole point of p2claw. p2claw exists so that agents can self host with no signup, so vibe coders can deploy with oauth with a single cli call, so web apps can be peer to peer. To do that we bypass a bunch of machinery most software relies on to participate in the internet. This is what programming is all about. Bending the system and the standards to your will. The more you bend, the wackier the bugs.
Two debugging lessons I'm keeping. First: a sender-side packet capture only proves the packets left. We "verified" fragments flowing with tcpdump on the box and called the path healthy; the fragments were leaving beautifully and dying on arrival, every time. Watch the receiver [easier said than done when receiver is a non jailbroken iPad but the point holds]. Second: when a bug only shows up on one device, before you blame the device, ask what path only that device takes.
The iPad was fine. The iPad was just on Tailscale. And Tailscale was just doing what the comment says it does.
UPDATE: Both issues are filed. The webrtc-rs constant is webrtc-rs/webrtc#806 and the IPv6 fragment drop is tailscale/tailscale#20083.