Also, networking is inherently stateful. NAT traversal, jitter buffers, congestion control, packet loss, codec state, encryption, and session routing do not disappear because you put audio over TCP or WebSocket. Pretending otherwise is not architectural clarity. It is just moving the complexity somewhere less visible.
You want real time that's what you are going to deal with. If you don't want real time and instead imagine everything as STT -> Prompt -> TTS then maybe you shouldn't even be sending audio on the wire at all.
Most of the glitches I heard with OpenAI's Voice were not WebRTC related - but rather, to my ear, they sounded more like realtime issues with their inference - which is a very different component to optimize.
This blog was super insightful for me to understand what are the root problems in the current implementation though.
IMO, tech standards should be simple and minimal and people should be able to implement whatever they want on top. I tend to stay away from complex web standards.
Had a nice chuckle.
Tangential, but by being that, it's also refreshingly human writing, vs the both-sidesy bullet listed AI pablum that's all around us these days.
I have zero take on the subject matter, but I like that the article had a detectably human flair.
And if it was AI written, god help us.
The answer came back over the same connection.
In the case of OpenAI, they can't exactly keep a persistent connection open like Alexa does, but they can use HTTP2 from the phone and both iOS and Android will pretty much take care of that connection magically.
The author is absolutely right, a real time protocol isn't necessary. It's more important to get all the data. The user won't even notice a delay until you get over 500ms. Especially in the age of mobile phones, where most people are used to their real time human to human communications to have a delay.
(If you work at OpenAI or Anthropic, give me a shout, I'm happy to get into more details with you)
> …but as a user, I would much rather wait an extra 200ms for my slow/expensive prompt to be accurate
This is the opposite of the feedback I get. Users want instant responses. If you have delay in generating responses/interruptions it kills the magic. You also don't want to send faster than real-time. If the user interrupts the model you just wasted a bunch of bandwidth sending 3 minutes of audio (but only played 10 seconds)
> TTS is faster than real-time
https://research.nvidia.com/labs/adlr/personaplex/ Voice AI for the latest/aspirational is moving away from what the author describes. It is trickled in/out at 20ms
> We really hope the user’s source IP/port never changes, because we broke that functionality.
That is supported. When new IP for ufrag comes in its supported
> It takes a minimum of 8* round trips (RTT)
That's wrong. https://datatracker.ietf.org/doc/draft-hancke-webrtc-sped/
> I’d just stream audio over WebSockets
You lose stuff like AEC. You also push complexity on clients. The simplicity of WebRTC (createOffer -> setRemoteDescription) is what lets people onboard easily. Lots of developers struggled with Realtime API + web sockets (lots of code and having to do stuff by hand)
----
I think if I had my choice I would pick Offer/Answer model and then doing QUIC instead of DTLS+SCTP. Maybe do RTP over QUIC? I personally don't feel strongly about the protocol itself. I don't know how to ship code to multiple clients (and customers clients) with a much large code footprint.
Having just had to tackle this again for my own startup, I'm reminded about what you would lose by ditching WebRTC - the audio DSP pipeline, transmit side VAD, echo cancellation, noise suppression, NAT traversal maturity, codec integration, browser ubiquity etc.
webrtc is a bad protocol, without a doubt. I do like websockets as an easy alternative, but you do need to reinvent decent portions of webrtc as a result
I like the idea of MoQ but it's not widely used. probably worth experimenting with, especially as video enters the chat
> and then a GPU pretends to talk to you via text-to-speech
OpenAI is speech-to-speech, there is no TTS in voice mode
> It takes a minimum of 8* round trips (RTT) to establish a WebRTC connection
signalling can be done long ahead of time, though I don't see this mentioned in the OpenAI blog. I also saw some new webrtc extensions that should reduce setup time further
ultimately though, it comes down to
> It’s not like LLMs are particularly responsive anyway
I expect to see a shift in how S2S models work to be lower latency like the new voice API models that OpenAI announced
to be fair, the new models were released the day after this MoQ blog was published
Like 6 years ago I wrote a WebRTC SFU at Twitch.
Originally we used Pion (Go) just like OpenAI,
but forked after benchmarking revealed that it was too slow.
I ended up rewriting every protocol, because of course I did!
Just a year ago, I was at Discord and I rewrote the WebRTC SFU in Rust.
Because of course I did! You’re probably noticing a trend.
Fun Fact: WebRTC consists of ~45 RFCs dating back to the early 2000s.
And some de-facto standards that are technically drafts (ex. TWCC, REMB).
Not a fun fact when you have to implement them all.
You should consider me a Certified WebRTC Expert.
Which is why I never, never want to use WebRTC again.
I think that they've done more than enough of 'trying the normal way' to be warranted in having an opinion the other way, don't you think?It’s 2026 and teleconferencing is still such a shit show. There’s billions of dollars to be had and Zoom is at best mediocre, and it can be as bad as Microsoft Whatchamacallit. I’ve never not seen teleconferencing be a ham handed mess.
I hope it’s getting better with education/more libraries. It’s also amazing how easy Codex etc… can burn through it now
Isn’t the point that OpenAI’s use case does not require realtime?
When OpenAI responds, it has most of the audio in advance of when the user needs to hear it. It produces audio faster than real time, so a real time protocol is a bad fit.
Every low-latency application has to decide the user experience trade-off between quality and latency. Congestion causes queuing (aka latency) and to avoid that, something needs to be skipped (lower quality).
The WebRTC latency vs. quality knob is fixed. It's great at minimizing latency, but suffers from a lack of flexibility. We still (try to) use WebRTC anyway, because like you implied, browser support has made it one of the only options.
Until now of course! WebTransport means you can achieve WebRTC-like behavior via a generic protocol. Choose how long you want to wait before dropping/resetting a stream, instead of that decision being made for you.
And yeah my point in the blog is that often the user wants streaming, but not dropping. Obviously you can stream audio input/output without WebRTC. The application should be able to decide when audio packets are lost forever... is it 50ms or 500ms or 5000ms? My argument is that voice AI shouldn't pick the 50ms option.
“Hell no”
> “Umm…”
You only need to send ~1 second at a time. There's no reason to send 20ms or 10 min at a time. Both are stupid.
WebRTC is complex, even if it's a library (even if it's a library built into the browser they're already using). For a client/server voice interaction, I don't see why you would willingly use it. Ship voice samples over something else; maybe borrow some jitter buffer logic for playback.
My job currently involves voice and video conferencing and 1:1 calls, and WebRTC is so much complexity... it got our product going quickly, but when it does unreasonable things, it's a challenge to fix it; even though we fork it for our clients.
I could write an enormous rant about TURN [1]. But all of the webrtc protocol suite is designed for an internet that doesn't exist.
[1] Turn should allocate a rendesvous id rather than an ephemeral port when the turn client requests an allocation. Then their peer would connect to the turn server on the service port and request a connection to the rendesvous id, without needing the client to know the peer address and add a permission. It would require less communication to get to an end to end relayed connection. Advanced clusters could encode stuff in the id so the client and peer could each contact a turn server local to them and the servers could hook things up; less advanced clusters would need to share the turn server ip and service port(s) with the id.
You run into issues around AudioContext and resumption etc... it's a PITA to have to handle all those corner cases :(
Did they really say they prefer fast response over accurate repsonse?
cldouflare doesn't support WebTransport well.
> This is the opposite of the feedback I get. Users want instant responses.
I am skeptical that you are getting feedback that users prefer instant wrong results to 200ms-lag correct results.
Deeply skeptical!
Not my experience, running around 6,000 conversations per day with voice, with webrtc + cascading (stt/llm/tts) architecture.
Maybe I misunderstood your comment, but that 500ms is basically the floor of a stat of the art voice implementation these days - if you are lucky and don't skimp, and do various expensive things like speculative decoding and reasoning. 450ms on the LLM pass alone. Every ms counts in commercial applications of voice ai. If you add 200ms or 300ms to that, it really degrades the conversation.
We do a lot of voice stuff to support our business, largely with unsophisticated, non technical users. Last year's attempts, with measured turn to turn latencies of around 1200ms-1500ms, led to a lot of user confusion, interruptions, abandoned conversations and generally very unpleasant experiences. We are at around 700ms turn to turn now, depending on tool usage needed, and its approaching an OK experience, rivalling an interaction with an actual human. We are spending quite a lot to shave another 100ms off that. We do expensive, wasteful things such as speculative LLM passes, we do speculative tool executions (do a few LLM inferences as the user speaks, but don't actually execute non-idempotent tool calls before you know that that LLM pass is usable and the user did not say anything important at the tail end of their sentence) just to shave 100-200ms. When someone says 500ms is irrelevant I am sure they are describing some other use case, not human-to-AI voice interactions.
In my experience with voice AI, the problem is not with some occasional dropped webrtc packets. The real hard problem is with strong background noises, echo, and of course accents. WebRTC with its polished AEC implementations helps quite a lot at least with echos. I get the protocol is a major PITA to implement at OpenAI scale, but for anything but hyperscale applications there is lots of good, viable solutions and commercial providers (say, Daily for instance) that make it a no problem. The real problems to solve are still elswhere. But boy, add 500ms to my latency budget and you've killed my application.
1. Of course users want lower latency, but they also want fewer instances where the LLM "misheard" them. It would be amazing to run A/B experiments on the trade-off between latency vs quality, but WebRTC makes that knob difficult to turn.
2. I'm obviously not an TTS expert, but what benefit is there to trickling out the result? The silicon doesn't care how quickly the time number increments?
3. Yeah, sometimes the client is aware when their IP changes and can do an ICE renegotiation. But often they aren't aware, and normally would rely on the server detecting the change, but that's not possible with your LB setup. It's not a big deal, just unfortunate given how many hoops you have to jump through already.
4. Okay, that draft means 7 RTTs instead of 8 RTTs? Again some can be pipelined so the real number is a bit lower. But like the real issue is the mandatory signaling server which causes a double TLS handshake just in case P2P is being used.
5. Of course WebRTC is easier for a new developer because it's a black box conferencing app. But for a large company like OpenAI, that black box starts to cause problems that really could be fixed with lower level primitives.
I absolutely think you should mess around with RTP over QUIC and would love to help. If you're worried about code size, the browser (and one day the OS) provides the QUIC library. And if you switch to something closer to MoQ, QUIC handles fragmentation, retransmissions, congestion control, etc. Your application ends up being surprisingly small.
The main shortcoming with RoQ/MoQ is that we can't implement GCC because QUIC is congestion controlled (including datagrams). We're stuck with cubic/BBR when sending from the browser for now.
I disagree with this SO strongly. I find the conversational voice mode to be a game changer because you can actually have an almost normal conversation with it. I'd be thrilled if they could shave off another 50-100ms of latency, and I might stop using it if they added 200ms. If I want deep research I'll use text and carefully compose my prompt; when I'm out and about I want to have a conversation with the Star Trek computer.
Interestingly I'm involved with a related effort at a different tech company and when I voiced this opinion it was clear that there was plenty of disagreement. This still surprises me since it seems so obvious to me that conversational fluidity is the number one most important feature.
Which results in the interesting situation where the transcript isn't what was said:
Q: Why do the voice transcripts sometimes not match the conversation I had?
A: Voice conversations are inherently multimodal, allowing for direct audio exchange between you and the model. As a result, when this audio is transcribed, the transcription might not always align perfectly with the original conversation.
[0] https://developers.openai.com/api/docs/models/gpt-realtime-t...
[0] https://developer.mozilla.org/en-US/docs/Web/API/RTCRtpRecei...
Can you repeat that please? It didn't make any sense. This conversation doesn't feel "real".
What I was saying is the same as you -- the user will tolerate a total delay of 500ms, and then happiness starts to fall off. We had some Alexa utterances at 500ms, the most basic ones, but most took longer.
However, even with http2 and the like, we could get in that range because of the fact that it was sending data right away, so we were mostly done processing the STT by the time they were done speaking, and we were already working on the answer based on the first part of the utterance.
But I would need to see some really strong evidence to even think about using WebRTC.
But you’re not. And you won’t. You’ll never have a conversation with the Star Trek computer while you continue to place anything else above accuracy. Every time I see someone comparing LLMs to the Star Trek computers, it seems to be someone who doesn’t understand that correctness was their most important feature. I’m starting to get the feeling people making that comparison never actually watched or understood Star Trek.
A computer which gives you constant bullshit is something only the lowest of the Ferengi would try to sell.
> This still surprises me since it seems so obvious to me that conversational fluidity is the number one most important feature.
It’s not. It absolutely is not and will never be. Not unless all you’re looking for is affirmation, companionship, titillation. I suggest looking for that outside chat bots.
People can tolerate missing words surprisingly well. If a phrase is slightly clipped, masked by noise, or dropped, the listener can often infer it from context. That happens constantly in real speech.
But pauses and stalls are much more damaging. A sudden freeze in the middle of speech breaks turn-taking, timing, and attention. It feels like the speaker stopped thinking, the connection died, or the system got stuck.
For voice UX, a tiny omission is often less harmful than a perfectly complete sentence that freezes halfway.
I prompt orchestrations most of the day, and am very particular about the fidelity of my context stack.
Yet I’ve used advanced voice mode on ChatGPT via the iOS app a lot. And I have not had a problem with it understanding my requests or my side of the conversation.
I have looked at the dictation of my side and seen it has blatant mistakes, but I think the models have overcome that the same way they do conference audio stt transcripts.
I have had times where the ~sandbox of those conversations and their far more limited ability to build useful corpus of context via web searches or by accessing prior conversation content.
The biggest problem I have had with adv voice was when I accidentally set the personality to some kind of non emotional setting. (The current config seems much more nuanced)
The AI who normally speaks with relative warmth and easy going nature turned into an emotionless and detached entity.
It was unable to explain why it was acting this way. I suspect the low latency did disservice there because when it is paired with something adversarial it was deeply troubling.
Sure, but I am skeptical that users are actually saying "I prefer wrong answers over lag", which is what the post I responded to implied.
This is different to user's saying "I prefer quick answers to laggy answers", which is what I presume they may have said.
To actually settle this, the feedback must answer the question "Do you want wrong answers quickly or correct answers with an added 0.2 second delay?" because, well, those are the only two options right now.
x...y...y[dedup]...zpublished 5/6/2026
OpenAI posted a technical blog a few days ago. This blog post triggered me more than it should have. I urge to slap my meaty fingers on the keyboard.
You should NOT copy OpenAI.
I don’t think you should use WebRTC for voice AI. WebRTC is the problem.
Like 6 years ago I wrote a WebRTC SFU at Twitch. Originally we used Pion (Go) just like OpenAI, but forked after benchmarking revealed that it was too slow. I ended up rewriting every protocol, because of course I did!
Just a year ago, I was at Discord and I rewrote the WebRTC SFU in Rust. Because of course I did! You’re probably noticing a trend.
Fun Fact: WebRTC consists of ~45 RFCs dating back to the early 2000s. And some de-facto standards that are technically drafts (ex. TWCC, REMB). Not a fun fact when you have to implement them all.
You should consider me a Certified WebRTC Expert. Which is why I never, never want to use WebRTC again.
I’m going to cheat a little bit and start with the hot takes before they get cold. Don’t worry, we’ll get right back to talking about the OpenAI blog post and load balancing, I promise.
WebRTC is a poor fit for Voice AI.
But that seems counter-intuitive? WebRTC is for conferencing, and that involves speaking? And robots can speak, right?
Let’s say I pull up my OpenAI app on my phone. I say hi to Scarlett Johansson Sky and then I utter:
should I walk or drive to the car wash?
WebRTC is designed to degrade and drop my prompt during poor network conditions.
wtf my dude
WebRTC aggressively drops audio packets to keep latency low. If you’ve ever heard distorted audio on a conference call, that’s WebRTC baybee. The idea is that conference calls depend on rapid back-and-forth, so pausing to wait for audio is unacceptable.
…but as a user, I would much rather wait an extra 200ms for my slow/expensive prompt to be accurate. After all, I’m paying good money to boil the ocean, and a garbage prompt means a garbage response. It’s not like LLMs are particularly responsive anyway.
But I’m not allowed to wait. It’s impossible to even retransmit a WebRTC audio packet within a browser; we tried at Discord. The implementation is hard-coded for real-time latency or else.
UPDATE: Some WebRTC folks are claiming this is a skill issue. It might be possible to enable audio NACKs, but we couldn’t figure out the correct SDP munging. Either way, the WebRTC jitter buffer is aggressively small.
And yes, Voice AI agents will eventually get the latency down to the conversational range. But reducing latency has trade-offs. I’m not even sure that purposely degrading audio prompts will ever be worth it.

Two roads diverged in a yellow wood. And sorry I could not travel both. And be one traveler, long I stood. And looked down one as far as I could. Until I ran out of tokens.
You speak into the microphone, it gets sent to one of OpenAI’s billion servers, and then a GPU pretends to talk to you via text-to-speech. Neato.
Let’s say it takes 2s of GPUs to generate 8s of audio. In an ideal world, we would stream the audio as it’s being generated (over 2s) and the client would start playing it back (over 8s). That way, if there’s a network blip, some audio is buffered locally. The user might not even notice the network blip.
But nope, WebRTC has no buffering and renders based on arrival time. Like seriously, timestamps are just suggestions. It’s even more annoying when video enters the picture.
To compensate for this, OpenAI has to make sure packets arrive exactly when they should be rendered. They need to add a sleep in front of every audio packet before sending it. But if there’s network congestion, oops we lost that audio packet and it’ll never be retransmitted.
OpenAI is literally introducing artificial latency, and then aggressively dropping packets to “keep latency low”. It’s the equivalent of screen sharing a YouTube video instead of buffering it. The quality will be degraded.

Thank for Mr Robot friend for the unambiguous advice
Fun fact: WebRTC actually adds latency. It’s not much, but WebRTC has a dynamic jitter buffer that can be sized anywhere from 20ms to 200ms (for audio). This is meant to smooth out network jitter, but none of this is needed if you transfer faster than real-time.
Okay but let’s talk about the technical meat of the OpenAI article. We’re no longer on a boat, but let’s talk about ports.
When you host a TCP server, you open a port (ex. 443 for HTTPS) and listen for incoming connections. The TCP client will randomly select an ephemeral port to use, and the connection is identified by the source/destination IP/ports. For example, a connection might be identified as 123.45.67.89:54321 -> 192.168.1.2:443.
But there’s a minor problem… client addresses can change. When your phone switches from WiFi to cellular, oops your IP changes. NATs can also arbitrarily change your source IP/port because of course they can.
Whenever this happens, bye bye connection, it’s time to dial a new one. And that means an expensive TCP + TLS handshake which takes at least 2-3 RTTs. The users definitely notice the network hiccup when you’re live streaming.
WebRTC tried to solve this issue but made things worse. Seriously.
A WebRTC implementation is supposed to allocate an ephemeral port for each connection. That way, a WebRTC session can identified by the destination IP/port only; the source is irrelevant. If the source IP/port changes, oh hey that’s still Bob because the destination port is the same.
But as OpenAI corroborates, this causes issues at scale because…
You could probably abuse IPv6 to work around this, but IDK I never tried. Twitch didn’t even support IPv6…
So most services end up ignoring the WebRTC specifications. Because of course they do. We mux multiple connections onto a single port instead.
At Twitch I literally hosted my WebRTC server on UDP:443. That’s supposed to be the HTTPS/QUIC port, but lying meant we could get past more firewalls. Like the Amazon corporate network, which blocked all but ~30 ports.
Discord uses ports 50000-50032, one for each CPU core. As a result it gets blocked on more corporate networks. But like, if you’re on a Discord voice call on the Amazon corporate network, you probably won’t be there much longer anyway.
HOWEVER, HUGE PROBLEM.
WebRTC is actually a bunch of standards in a trenchcoat, and 5 of those go over UDP directly. It’s not hard to figure out which protocol a packet is using, but we need to figure out how to route each packet.
ufrag and route on it.ssrc (u32)… which we can usually route based on.So OpenAI only uses STUN:
No protocol termination: Relay parses only STUN headers/ufrag; it uses cached state for subsequent DTLS, RTP, and RTCP, keeping packets opaque.
It’s a positive way of saying:
We really hope the user’s source IP/port never changes, because we broke that functionality.
While it’s impressive load balancing anything at OpenAI scale, their custom load balancing is a hack. But a necessary hack, because the core protocol is at fault.

Personally, I would prefer 3 raccoons.
Fun fact: Browsers can randomly generate the same ssrc. If there is a collision, and no source IP/port mapping is available, Discord attempts to decrypt the packet with each possible decryption key. If the key worked, hey we identified the connection!
The OpenAI blog post starts with 3 requirements, one of them is:
- Fast connection setup so a user can start speaking as soon as a session begins
lol
It takes a minimum of 8* round trips (RTT) to establish a WebRTC connection. While we try to run CDN edge nodes close enough to every user to minimize RTT, it adds up.
Signaling server (ex. WHIP):
Media server:
* It’s complicated to compute, because some protocols can be pipelined to avoid 0.5 RTT. Kinda like half an A-Press.

an obscure reference to an obscure reference
All of this nonsense is because WebRTC needs to support P2P. It doesn’t matter if you have a server with a static IP address, you still need to do this dance.
It’s extra depressing when the signaling and media server are running on the same host/process. You end up doing two redundant and expensive handshakes. It’s like walking AND driving your car to the car wash.
Fun Fact: This was originally going to be a Fun Fact, but it gets its own section now.
WebRTC practically encourages you to fork the protocol. There’s so many limitations that I’ve barely scratched the surface. The browser implementation is owned by Google and tailor made for Google Meet, so it’s also an existential threat for conferencing apps.
Sad Fact: That’s why every conferencing app (except Google Meet) tries to shove a native app down your throat. It’s the only way to avoid using WebRTC.
OpenAI definitely has the debt funding to do this. But I think they should also throw the baby out with the bath water. Don’t fork WebRTC, replace it with something that has browser support.
Fun Fact: Discord has forked WebRTC so hard that native clients only implement a tiny fraction of the protocol. No more SDP/ICE/STUN/TURN/DTLS/SCTP/SRTP/etc. But we still have to implement everything for web clients.
If not WebRTC, then what should you use for Voice AI?
Honestly, if I was working at OpenAI, I’d start by stream audio over WebSockets. You can leverage existing TCP/HTTP infrastructure instead of inventing a custom WebRTC load balancer. It makes for a boring blog post, but it’s simple, works with Kubernetes, and SCALES.
I think head-of-line blocking is a desirable user experience, not a liability. But the fated day will come and dropping/prioritizing some packets will be necessary. Then I think OpenAI should copy MoQ and utilize WebTransport, because…
QUIC FIXES THIS
Remember the round trip discussion? Good times. Here’s how many RTTs it takes to establish a QUIC connection:
But that was an easy one. Let’s dive into the deeper details of QUIC that you wouldn’t know about unless you’re a turbo QUIC nerd (it me).
Remember that link to RFC9146? In the DTLS section? That you didn’t click? Good times. The idea is literally copied from QUIC.
QUIC ditches source IP/port based routing. Instead, every packet contains a CONNECTION_ID, which can be 0-20 bytes long. And most importantly for us: it’s chosen by the receiver.
So our QUIC server generates a unique CONNECTION_ID for each connection. Now we can use a single port and still figure out when the source IP/port changes. When it does, QUIC automatically switches to the new address instead of severing the connection like TCP.
But if your gut reaction is: “how dare they! this is a waste of bytes!” These bytes are very important, keep reading u nerd.
I glossed over this, but OpenAI’s load balancers (like most) depend on shared state. Even if you have a sticky packet router, load balancers can still restart/crash. Something has to store the mapping from source IP/port -> backend server.
They’re using a Redis instance to store the mapping of source IP/port to backend server. Simple and easy, I approve.
But do you know what is even simpler and easier? Not having a database. Here’s how QUIC-LB does it:
When a client initiates a QUIC connection, the load balancer forwards the packet to healthy backend server. The backend server completes the handshake and encodes its own ID into the CONNECTION_ID. That way every subsequent QUIC packet contains the ID of the backend server.
Now packets become trivial for load balancers to forward. They don’t need encryption keys or a routing table, just decode the first few bytes and forward it to that guy. It doesn’t even matter if the server reboots.
Zero state also means zero global state. These load balancers could listen on a global anycast address and forward packets globally to the indicated backend server. Cloudflare uses this extensively; no need for a global Redis cluster.
Unpaid Shill: AWS NLB offers QUIC load balancing using QUIC-LB. Other cloud providers need to step up their game and offer it too.
Based on the OpenAI blog, it sounds like they assign connections to regional load balancers. Functional but lame. Anycast is way cooler.
I brought this up in my ancient Quic Powers blog post, but I’ll excuse you for not reading it (yet). QUIC has something called preferred_address that is a game changer for load balancing.
Let’s say we have thousands of backend servers around the world that could accept a new connection. We have them all advertise the same anycast address, ex. 1.2.3.4. When a client tries to connect to 1.2.3.4, the magic internet routers forward the packet to one of the servers.
Now, we could just use QUIC-LB and route traffic to the indicated backend. But that would be boring.
Instead, we can give each QUIC server a unique unicast address, ex. 5.6.7.8. The idea is that we use anycast for handshakes and unicast for stateful connections.
1.2.3.4 and 5.6.7.8.1.2.3.4.preferred_address=5.6.7.8.5.6.7.8.When the server is overloaded and doesn’t want more connections, it stops advertising 1.2.3.4. We won’t drop existing connections because they’re safe on unicast.
Just like that, no load balancers needed! The anycast address is basically a health check!
Holy shit I wish I actually had the scale to build this. Reach out if you work for the orange butthole company.

Looks something like this but orange.
WebRTC
QUIC

I have labeled QUIC as the chad, therefore it is the superior protocol.
I know many engineers at OpenAI and they are extremely bright. They’re dealing with unprecedented levels of stress. They MUST scale and they MUST scale now.
I’m just some guy who quit my job to work on a passion project. I literally spend my time tracing memes. It’s easy for me to judge from my lofty position, like a movie critic ranting about how they casted Jared Leto again?
I just don’t think the obvious solution is a good fit for Voice AI. And the obvious solution is very difficult to scale. WebRTC is Jared Leto. There I said it.
And I’ll be honest, MoQ isn’t a perfect fit for Voice AI either. It’ll work, but a lot of the cache/fanout semantics are useless for 1:1 audio. You should definitely use QUIC though.
Anyway, hit me up if you want to chat: meself@kixel.me
I’m cool. You won’t regret it. Probably.
Written by @kixelated. ![]()