How OpenAI delivers low-latency voice AI at scale

Very grateful that OpenAI published the article/publicized their usage of Pion[0] a library I work on. If you aren't familiar with WebRTC it's a super fun space. I work on a book WebRTC for the Curious [1] that details how it works.

[0] https://github.com/pion/webrtc

[1] https://webrtcforthecurious.com

The low latency is more of a pain point than a good thing, the way they have it implemented. Trying to have a casual conversation with it, as humans we naturally pause, and GPT will take this as you are "done" and start blabbing away.

I also suffer from finding the appropriate word I want as I've gotten older and slower, and this fast-voice-gpt just ends up frustrating me more than helping. I have to sit there and think out the whole sentence in my head before I say anything -- not very natural.

Wait a minute... I’m genuinely happy that they are sharing this, but keep in mind that realtime audio model from OpenAI are still stuck with the 4o family in terms of capabilities, sadly. I still find them so useful, such a pity that there’s no real competitor in this segment, having the experience a real conversation has helped me so much in expressing ideas and concepts.

Still, it’s worth to keep in mind that these are not frontier models, differently from when they were released.

(Please Sam, if you read this, release the new realtime audio models)

> Voice AI only feels natural if conversation moves at the speed of speech […] At OpenAI’s scale, that translates into three concrete requirements: Global reach for more than 900 million weekly active users

Surely the number refers to the total users of ChatGPT overall, and the fraction of those who use voice features is considerably smaller, is it not?

That’s the kind of thing that influences business decisions like knowing how much hardware and software optimization to throw at a problem.

RFC 9297 support can't come quick enough in browsers. Would obviate having to deal with WebRTC in a client-server scenario.

if anyone is looking to get into this. pipecat is a great open-source repo and community. https://github.com/pipecat-ai/pipecat

I wouldn't mind waiting longer for answers that would go through a better model with more thinking. As long as it has good support for interrupting and also it doesn't start answering as soon as I pause for 1 second and it's smart about knowing I'm done speaking.

IMO this probably isn't just about latency. keeping people in voice gives them training data text never will. is that why they were fine going transceiver over sfu and mostly ignoring multi-party?

Am I reading this right that OpenAI is not using Livekit for WebRTC/audio anymore?

If a transceiver crashes during a stream, how is the active session recovered? Does the system automatically re-establish the context in a new WebRTC session?

I think it's better to join some kind of club if you want to make friends?

I hate the voice ai though, it's so much dumber

what i learned from making a webrtc+kubernetes game streaming product:

- openai is wrong. almost of the issues they described are issues with libwebrtc, not with webrtc, kubernetes, network architecture, etc. the clue was when they said "the conventional one-port-per-session WebRTC model."

- there are no alternatives worth trying. everything else open source in the ecosystem, like pion, coturn, stunner, are too immature.

- libwebrtc is the only game in town.

- they haven't discovered libwebrtc feature flags or how it works with candidates, which directly fix a bunch of latency issues they are discovering. a correct feature flag can instantly reduce latency for free, compared to pay for twilio network traversal style solutions

- 99% of low latency voice END USERS will be in a network situation that can eliminate relays, transceivers, etc. it is totally first class on kubernetes. but you have to know something :)

this is the first time i'm experiencing gell mann amnesia with openai! look those guys are brilliant, but there is hardly anyone in the world who is doing this stuff correctly.

Should I or shouldn't I be glad to see zero mention on Codex.

so is the answer

WebRTC + Kubernetes

OpenAI uses Go for the networking implementation for the relays and the services, which makes a ton of sense, instead of something immature as TypeScript / Node or whatever.

Yet another reason to not consider anything else like that for low-latency networking. Golang (or even Rust and C++) is unmatched for this use-case.

Who cares? Their company is dying.

It's missing the part where they explain how they obtained the training data for their voice AI.

It's bad enough having to speed-read the waffle of its written answers; even when told to be concise, the thought of having to listen to it waffle on in its smarmy, sycohpantic fashion makes me want to reach for the sick bag.

> Global reach for more than 900 million weekly active users

lol, definitely didn't need to know there's 900M weekly users for this post. I mean yeah, there's a lot of users and they serve globally, that's relevant. But this is just pulling out your biggest stat because you can. How many voice users you have would actually be relevant and interesting but, to baselessly speculate on motivation here, might be a number that doesn't add as much fuel to an upcoming IPO as reminded people that you're almost at a billion users does.

RFC 9297 support can't come quick enough in browsers. Would obviate having to deal with WebRTC in a client-server scenario.

If a transceiver crashes during a stream, how is the active session recovered? Does the system automatically re-establish the context in a new WebRTC session?

It doesn't today, but you could with sometime like this [0]. You can save/suspend all WebRTC state and bring it back with the next process.

[0] https://github.com/pion/webrtc-zero-downtime-restart

Surely the number refers to the total users of ChatGPT overall, and the fraction of those who use voice features is considerably smaller, is it not?

That’s the kind of thing that influences business decisions like knowing how much hardware and software optimization to throw at a problem.

Yeah, that's why they've used "reach" - the total number of users who could be exposed to the feature regardless of engagement.

I think it's better to join some kind of club if you want to make friends?

so is the answer

WebRTC + Kubernetes

OpenAI uses Go for the networking implementation for the relays and the services, which makes a ton of sense, instead of something immature as TypeScript / Node or whatever.

Yet another reason to not consider anything else like that for low-latency networking. Golang (or even Rust and C++) is unmatched for this use-case.

"something immature as TypeScript / Node or whatever"

Node.js's initial release was May 27, 2009

Golang 's initial release was November 10, 2009

They're different, yes, but it's not like

Can golang do zero copy networking nowadays? In the past golang was terrible at this kind of thing due to allocations and copies of all relayed data.

Am I reading this right that OpenAI is not using Livekit for WebRTC/audio anymore?

It does appear that way. The LiveKit server is not what you would want for this architecture anyway (as they basically say with the SFU discussion), although it does have a lot of useful stuff in the client SDKs.

Still, it’s worth to keep in mind that these are not frontier models, differently from when they were released.

(Please Sam, if you read this, release the new realtime audio models)

what i learned from making a webrtc+kubernetes game streaming product:

- there are no alternatives worth trying. everything else open source in the ecosystem, like pion, coturn, stunner, are too immature.

- libwebrtc is the only game in town.

- 99% of low latency voice END USERS will be in a network situation that can eliminate relays, transceivers, etc. it is totally first class on kubernetes. but you have to know something :)

this is the first time i'm experiencing gell mann amnesia with openai! look those guys are brilliant, but there is hardly anyone in the world who is doing this stuff correctly.

if anyone is looking to get into this. pipecat is a great open-source repo and community. https://github.com/pipecat-ai/pipecat

Who cares? Their company is dying.

It's missing the part where they explain how they obtained the training data for their voice AI.

IMO this probably isn't just about latency. keeping people in voice gives them training data text never will. is that why they were fine going transceiver over sfu and mostly ignoring multi-party?

> Global reach for more than 900 million weekly active users

It doesn't today, but you could with sometime like this [0]. You can save/suspend all WebRTC state and bring it back with the next process.

[0] https://github.com/pion/webrtc-zero-downtime-restart

Claude voice mode has come a long way! I'd say it's smarter than CGPT AVM last time i tried it.

But personally I've settled on just speaking to the slower models over a custom tts app, I find it being instant was not actually that important, and in the silence I find myself marinating in the discussion more anyway

Grok voice is surprisingly good, actually. It's still a dumber model than the thinking modes of frontier models, but it's less dumb than the voice modes of other providers.

Yes the voice part of OpenAI realtime/voice mode is great but it’s pretty dumb compared to newer models and often gets stuck repeating itself.

Google’s Gemini flash live 3.1 is better, especially used via the API - it can do tool calling (including to other, even smarter LLMs if you set it up yourself), you can set the reasoning level (even high is still close enough to realtime) and it can ground answers in google search. I love bidirectional voice and right now it’s probably the best option. You can try it in AI studio

Yeah I was quite surprised that the advanced chat gpt voice mode can’t itself go and message the frontier model underneath to retrieve data and then speak it. I basically tried asking it for that (something like “can you go and ask gpt5.5 to research this more in depth, and while we wait, tell me about XYZ”), but apparently that’s not a thing.

You can feel what is possible using Gemini speech to speech model, it can do tool calls and is very fast. It lacks somewhat in thinking capability but you can setup a tool call to a smarter model and it acts as a relay. I’ve been very impressed.

This is what makes their voice mode unusable to me. I can’t stand the way 4o replies and it’s such a big jump in quality from text mode

Yeah, the question in the title can be answered: "by using gpt-4o, a model 2 years behind the frontier, to serve audio responses"

Did you use libwebrtc on the backend? When you say `libwebrtc` is the only game in town are you talking about clients or servers?

Even for clients you have things like libpeer that libwebrtc can't hit.

When you have hard problems with unclear optimal solutions, taking this approach of a public show & tell will often (always?) solicit lots of interesting ideas the team may have not yet considered :)

Something I noticed is that companies that are vibe-coding their products miss out on the intelligence that (still) only humans can bring to bear. Just the knowledge cutoff alone puts AI at a serious disadvantage in any rapidly changing field.

If you like Pipecat’s focus on speed, you might also try out our open source, which comes with all the batteries included (knowledge base, telephony/SIP, variables, BYOK any LLM STT TTS, Speech to Speech, etc )

And it's fully OSS- like n8n for voice AI, and you can use it with OpenClaw or Claude code - recently launched MCPs.Github- https://github.com/dograh-hq/dograh, Youtube -https://www.youtube.com/watch?v=sxiSp4JXqws&list=PLDqzGuN7B1...

I wish I had known about Pipecat a lot sooner. I found out about it a few weeks back, and since Gemma 4 launched, I've been building my own entirely local voice assistant using Gemma 4 + Kokoro TTS + Whisper from scratch - https://github.com/pncnmnp/strawberry.

Pipecat's smart turn model is really good for VAD - https://huggingface.co/pipecat-ai/smart-turn-v3

I've been looking at this! Great project.

Should I or shouldn't I be glad to see zero mention on Codex.

Shouldn’t, I think - advanced voice is a surprisingly slick feature, and if you’re someone who feels that they can think and speak more naturally than when they think and type, AI voice transcription is kind of huge.

[0] https://github.com/pion/webrtc

[1] https://webrtcforthecurious.com

I use pion thanks for making it!

Curious if you thought their approach was necessary, it seemed like a ton of complexity to reduce one of the faster parts of a voice AI setup. Having a fast model and accurate VAD seems way more important than fine tuning WebRTC transit times.

Thanks for WebRTC for the Curious and for Pion! Not using the latter directly, but have used both to better understand WebRTC

Appreciate you putting the entire book online!

I read parts of it a while ago when I had an idea on using webRTC data channels to pass data from databases to browser clients via a CLI. Your book made me understand that it's probably not a great fit for my use case. I just used a centralized control plane and websockets instead.

I still feel like there is something fun that we can do with webRTC data channels + zero copy Apache Arrow arraybuffers + duckdb WASM, but haven't figured it out yet

WebRTC is great and so is Pion, thanks for help making and maintaining it! I loved learning about WebRTC from WebRTC for the Curious!!!

slightly unrelated but what’s with storing the entire codebase in the root directory instead of a nested src folder? It makes getting to the README a lot more difficult

I used pion and it was fantastic. Most of the article seems pretty standard webrtc techniques for performant voice.

Only a software dev would start their referencing at 0 lol

I hate the voice ai though, it's so much dumber

I used to use it all the time until about a year ago or so. Its responses are full of filler and the safeguards are really overbearing. It often will just give wrong answers in a way that GPT-5.x does not. I once asked it why a particular celebrity was canceled and it refused to tell me because it may harm me to know what they said!

Fwiw - I found the advanced AI voice feature to be actually detrimental. It's good if you just want a single sentence answer. I've turned it off though when I want a more detailed, structured, considered answer.

Yeah, that's why they've used "reach" - the total number of users who could be exposed to the feature regardless of engagement.

I think these are 2 different layers of "latency". The latency in the article is referring to the transport of the audio stream itself while the latency in your scenario is about how quickly to start responding inside the audio stream.

I’ve also experienced this and it’s really annoying. There is this pressure to keep talking if I’m not done with my thought that feels pretty unnatural at least for me. If I’m searching for the right word, I want the opportunity to find it.

I think the solution is to handle pauses more intelligently rather than having a higher latency protocol. With low latency you can interrupt and the bot can immediately stop rambling.

This has more to do with Voice Activity Detection (VAD) than the latency described in the article

There's a really interesting project in Japanese natural language processing called J-Moshi that had a novel approach and in my opinion good results.

They tried to make it mimic the way Japanese is full of really quick acknowledgement sounds and it seems to allow it to handle those pauses and interruptions really well.

https://en.nagoya-u.ac.jp/news/articles/say-hello-to-j-moshi... (english)

https://nu-dialogue.github.io/j-moshi/ (japanese and english)

I must admit it's a bit weird when LLMs laugh, I don't really know how I feel about that but it seems to laugh at the right times. Very tangential, but cockatoos have been known to mimic the right time to laugh presumably based on tonal cues that a joke was just made (I have experienced this first hand with rescue birds who li e amongst humans)

Reducing the network latency helps with this exactly. OpenAI can make better timed decisions when to begin responding so it'll feel less like an interruption. I've also seen some research on full duplex voice models that handle interruption more like an organic conversation and low latency will help there as well

In voice conversations I tell it not to reply at all or only say “Understood” until I use some kind of code word. Not perfect, but less intrusive.

People are migrating to the "End Of Thought" triggers. Deepgram does that wonderfully.

This is more of a VAD/turn detection issue. It's gotten a lot better over the last few years, but it's a hard problem. The extra ~100ms of latency makes a huge difference otherwise, especially when you have use cases that require tool calling that can easily add 500ms+ of latency.

Hard problem. I find myself adding in filler to stop the thing from jabbering.

I also think it spends most of its iq on sounding good rather than thinking about the problem. “Yeah absolutely I can see why you’d like to…” etc. This is likely because it’s on a timer and maybe voice is more expensive to process? Text responses spend more time on the task.

Strongly agree, some of us like to choose our words more carefully when interacting with an LLM.

I've tried to convey this to OpenAI through various available channels (dev forums, app feedback, etc.).

Grok solves this by having an optional push-to-talk mode, but this is not hands-free and thus more cumbersome than just having a user-configurable variable like seconds_delay_before_sending_voice_input.

To defend them a little: voice is a little rough around the edges now, so there’s a chicken and egg problem of whether to prioritize improving voice if usage isn’t high partially because it’s clunky.

"something immature as TypeScript / Node or whatever"

Node.js's initial release was May 27, 2009

Golang 's initial release was November 10, 2009

They're different, yes, but it's not like

okay, sure, but one is by microsoft, the other by a 25 year old, and another by rob pike. The one by rob pike is going to be infinitely more mature and thought out than a hacky type system on JS because it isn't his first rodeo

Can golang do zero copy networking nowadays? In the past golang was terrible at this kind of thing due to allocations and copies of all relayed data.

They do link to the Livekit docs in the footnotes: https://docs.livekit.io/transport/self-hosting/kubernetes/

whats wrong with livekit ?

Strongly agree, some of us like to choose our words more carefully when interacting with an LLM.

I've tried to convey this to OpenAI through various available channels (dev forums, app feedback, etc.).

Claude voice mode has come a long way! I'd say it's smarter than CGPT AVM last time i tried it.

This is what makes their voice mode unusable to me. I can’t stand the way 4o replies and it’s such a big jump in quality from text mode

Yeah, the question in the title can be answered: "by using gpt-4o, a model 2 years behind the frontier, to serve audio responses"

When you have hard problems with unclear optimal solutions, taking this approach of a public show & tell will often (always?) solicit lots of interesting ideas the team may have not yet considered :)

I've been looking at this! Great project.

They do link to the Livekit docs in the footnotes: https://docs.livekit.io/transport/self-hosting/kubernetes/

whats wrong with livekit ?

There's a really interesting project in Japanese natural language processing called J-Moshi that had a novel approach and in my opinion good results.

They tried to make it mimic the way Japanese is full of really quick acknowledgement sounds and it seems to allow it to handle those pauses and interruptions really well.

https://en.nagoya-u.ac.jp/news/articles/say-hello-to-j-moshi... (english)

https://nu-dialogue.github.io/j-moshi/ (japanese and english)

People are migrating to the "End Of Thought" triggers. Deepgram does that wonderfully.

Thanks for WebRTC for the Curious and for Pion! Not using the latter directly, but have used both to better understand WebRTC

WebRTC is great and so is Pion, thanks for help making and maintaining it! I loved learning about WebRTC from WebRTC for the Curious!!!

I used pion and it was fantastic. Most of the article seems pretty standard webrtc techniques for performant voice.

id rather use the thinking models so the voice mode isnt' useful, i do use voice-to-text more and more just to speed things up though

Grok voice is surprisingly good, actually. It's still a dumber model than the thinking modes of frontier models, but it's less dumb than the voice modes of other providers.

Grok voice model is also a thinking model. I agree that it’s far better than the other voice models

Just give me a option to have a slower response but better model…

Yes the voice part of OpenAI realtime/voice mode is great but it’s pretty dumb compared to newer models and often gets stuck repeating itself.

Thanks, I’ll try it, even if my experience wasn’t that great with Google models lately (503s)

Did you use libwebrtc on the backend? When you say `libwebrtc` is the only game in town are you talking about clients or servers?

Even for clients you have things like libpeer that libwebrtc can't hit.

Pipecat's smart turn model is really good for VAD - https://huggingface.co/pipecat-ai/smart-turn-v3

This has more to do with Voice Activity Detection (VAD) than the latency described in the article

I think the solution is to handle pauses more intelligently rather than having a higher latency protocol. With low latency you can interrupt and the bot can immediately stop rambling.

In voice conversations I tell it not to reply at all or only say “Understood” until I use some kind of code word. Not perfect, but less intrusive.

Hard problem. I find myself adding in filler to stop the thing from jabbering.

I use pion thanks for making it!

slightly unrelated but what’s with storing the entire codebase in the root directory instead of a nested src folder? It makes getting to the README a lot more difficult

Appreciate you putting the entire book online!

I still feel like there is something fun that we can do with webRTC data channels + zero copy Apache Arrow arraybuffers + duckdb WASM, but haven't figured it out yet

Only a software dev would start their referencing at 0 lol

yes - i used libwebrtc on the backend and, pre-LLM, patched it to work around a lot of the things i discovered that were directly related to low latency AV streaming. pion didn't exist then.

i think the challenge is that pion is an excellent product today. it would benefit me if its innovations were subsumed into libwebrtc, because eventually those innovations will show up in the iOS stack, which is one of the customers that matter to me. it is subjective if it is the MOST important customer, that is my belief and it is probably true of openai, at least until they get their own device out the door.

there can be many, many use cases though! not everything has to be, try to make the thing for 1b people that has to interact with all the most powerful and meanest businesses on the planet.

GPT 5.5's knowledge cutoff is August 2025. Which aspect of WebRTC has meaningfully changed since then?

Yeah Gemma4 was and is great fun to do this with - I too am building pretty much the same as yourself in Go.

https://github.com/zarldev/zarl & https://www.zarl.dev/posts/hal-by-any-other-name

What do you have going on the hardware side? I want to plug this into hass but don’t know what hardware I need for reasonable latency

They are orthogonal.

Suppose you have 100ms audio latency and no wait time. Then, natural pause will trigger response immediately but you won't notice it has started until after ~200ms (round-trip time). Twice as annoying.

I think he’s saying they are doing an insane level of complexity to shave ~100ms off response times in a scenario where that isn’t important and might even be a negative

That seems to be the issue: VAD is insufficient here.

Knowing when to respond requires semantic understanding, which probably only the model itself is capable enough.

Maybe it’s hard for them to train it to only respond once it seems appropriate to do so?

Exactly. It's a tangent, but clearly a pain point for enough users.

100%. I have to hold the floor by filling the space with "ummmmmmmm.... uhhhh...." which inevitably distracts me from my point altogether. Poor user experience.

Have you tried telling it to pause to let you think?

I often use it while I’m walking and tell it to not respond until I initiate a conversation.

I find this is a problem even with human conversations. Some people just aren’t very good at telegraphing when they’ve finished ‘their turn’ talking. Or worse yet, aren’t willing to take turns in the first place.

It seems that tool calling shouldn't be 500ms of latency?

100% .. as a product designer/developer, i use it heavily for early feature ideation .. i’ll do a loose, exploratory back and forth on a long walk .. then pass the transcript to claude to validate and turn into a spec ..

Their voice capable model is several generations behind the state of the art text-only one, as far as I know.

I don’t think it even has reasoning tokens, so it’s no surprise that it’s as most as smart as the “instant” models (i.e., not very).

Fwiw you can prompt it to respond differently to you.

Thanks for using it :)

I think It’s a case of you improve what you own. The owners of WebRTC servers were aggressively improving their part. They don’t own the inference servers.

Thats the default for go projects. Go imports are repository strings (e.g.):

     import ("github.com/go-sql-driver/mysql")

so it's standard to have the library files in the root directory.

This is valid criticism. Go fanbois don't like listening to any go criticism. They were all like who needs templates in go. and now go has templates.

To me go code looks like somebody vomitted stuff in the root dir and i have to wade through that every time. No namespacing. nothing

Thanks for reading it!

You can't beat Websockets :) Especially since you have so much tooling/existing stuff that works with HTTP.

I have been trying to get a website off the ground that does Datachannels + SQlite in the browser and then users sync between each other. I have gotten distracted so many times though.

What is preventing the fun is that even though we now have IPv6 widely enough available we still can't have p2p connections in the browser without a cumbersome control plane of servers. If you could join a federation in the browser from some bootstrap IPs then I think we could have some real distributed fun.

I do this too I never made the connection.

Interestingly, that kind of parallels the real world too: if you want a quick and high level answer, talk to someone in person; if you want something detailed and info-dense, get them to write it down.

Grok voice model is also a thinking model. I agree that it’s far better than the other voice models

Just give me a option to have a slower response but better model…

id rather use the thinking models so the voice mode isnt' useful, i do use voice-to-text more and more just to speed things up though

yes - i used libwebrtc on the backend and, pre-LLM, patched it to work around a lot of the things i discovered that were directly related to low latency AV streaming. pion didn't exist then.

there can be many, many use cases though! not everything has to be, try to make the thing for 1b people that has to interact with all the most powerful and meanest businesses on the planet.

They are orthogonal.

Exactly. It's a tangent, but clearly a pain point for enough users.

Their voice capable model is several generations behind the state of the art text-only one, as far as I know.

I don’t think it even has reasoning tokens, so it’s no surprise that it’s as most as smart as the “instant” models (i.e., not very).

Thanks for using it :)

I think It’s a case of you improve what you own. The owners of WebRTC servers were aggressively improving their part. They don’t own the inference servers.

Fwiw you can prompt it to respond differently to you.

Thats the default for go projects. Go imports are repository strings (e.g.):

     import ("github.com/go-sql-driver/mysql")

so it's standard to have the library files in the root directory.

I do this too I never made the connection.

Thanks, I’ll try it, even if my experience wasn’t that great with Google models lately (503s)

Give it a shot, 3.1 live one in AI studio/API and max out reasoning - not the one in Gemini app it’s an older model.

Another option is to use pipecat with their VAD and separate STT and TTS and any (fast) LLM of your choice - but it’s more plumbing and not a true speech to speech model

Yeah Gemma4 was and is great fun to do this with - I too am building pretty much the same as yourself in Go.

https://github.com/zarldev/zarl & https://www.zarl.dev/posts/hal-by-any-other-name

Looks like everyone is building one of these, I have my own little version that's using streaming STT, it can actually be too fast in some cases, and I have a little ring buffer grabbing audio from before the wake word detection fires (so it can hear "Hey Jarvis, turn on the lights" without deliberate pause) https://github.com/jaggederest/pronghorn/

GPT 5.5's knowledge cutoff is August 2025. Which aspect of WebRTC has meaningfully changed since then?

Dec 2025, actually: https://developers.openai.com/api/docs/models/gpt-5.5

(though knowledge cutoffs in practice can be bit fuzzy)

There's a difference between some piece of information being "officially published" and the AIs gaining a sufficient understanding of it.

Take any popular technology problem that has been around for a few years such as... wrangling Kubernetes with YAML config files. There's probably hundreds of thousands of discussions, source code samples from GitHub, official docs, blogs, bug reports, pull requests, etc... all discussing the nuances, pitfalls, pros/cons, etc. During pre-training the AIs internalise this and can utilise it later.

Now compare this with anything recent and (relatively) obscure, such as new .NET 10 features which were first officially publishing in November 2025, a month before GPT 5.5 cutoff.

As a human developer, these new language capabilities are on the same "level" for me in my day-to-day work as the features from .NET 9 or .NET 8. Similarly, my IDE has native refactoring and code cleanup support that can take C# code from the previous years and bring it up to the idiomatic style of $currentyear.

The AIs just can't do this, because one single Microsoft release note and one learn.microsoft.com page is nowhere near enough training data! The AI hasn't seen millions of lines of code written with .NET 10, taking advantage of .NET 10 improvements, and hasn't seen thousands of discussions about it. Not yet.

This is a fundamental issue with how LLMs are (currently) trained! Simply moving the cutoff date is not enough.

Human learning is second-order. If I see even the tiniest bit of updated information that invalidates a huge pile of older information, my memory marks everything old as outdated and from that second onwards I use only the new approach.

AI learning is first-order. It has to be given the discussions/blogs/posts that say "Stop using the legacy way, it's terrible! Start using the new hotness"! That, it can learn, but it'll be perpetually behind the rest of us by at least a few years.

Not to mention that thanks to AI forums like StackOverflow are dying, so... where is it going to get this kind of training data from in the future!?

AI training needs to switch to "second order", but AFAIK this is an unsolved problem at this time.

The problem is the sheer amount of knowledge out there. Particularly when using niche technologies (which webrtc and web audio still is, when measured by how many people develop using it), it is not surprising that AI doesn't have everything available in its responses, unless you specifically ask it about something you already know it should know.

What do you have going on the hardware side? I want to plug this into hass but don’t know what hardware I need for reasonable latency

The whole setup works on my M2 MacBook Pro with 16 GB RAM. I use Gemma 4B via LiteRT-LM.

I've found that LiteRT-LM has a much lower DRAM footprint than Ollama. I've also made tons of optimizations in the code - for eg, you can do quite a bit with a 16k context window for a voice assistant while managing a good footprint, so I keep track of the token usage and then perform an auto-compaction after a while. I use sub-agents and only do deep-think calls with them, so the context window is separated out. In a multi-turn conversation, if Gemma 4 directly processes audio input, the KV cache fills up within a few turns, so I channel it all via Whisper.

Also, by far the biggest optimization is: 3-stage producer-consumer architecture. The LiteRT-LM streams tokens and I split them into sentences. A synthesizer thread then converts each sentence to audio via Kokoro TTS - the main thread then plays audio chunks sequentially. There's a parallel barge-in monitor thread. https://github.com/pncnmnp/strawberry/blob/main/main.py#L446

I did not want to use openWakeWord or Picovoice because they had limitations on which wake word you could choose. Alternative was to train a model of my own. So I created my own wake word detection pipeline using Whisper Tiny - works surprisingly well: https://github.com/pncnmnp/strawberry/blob/main/main.py#L143...

Also, I have VAD going with smart turn v3 (like I mentioned above) + I use browser/websocket for AEC + Barge-in (https://github.com/pncnmnp/strawberry/blob/main/audio_ws.py).

I'm using the MacBook's built-in microphones for this, though, and I haven't fully tested it with other microphones. I've been ironing out the rough edges on a daily basis. I should write a quick blog on this too.

Check out [0]. You can do 'Voice AI' on small/cheap hardware. It's the most fun you can have in the space ATM :) It's been a while, but posted a demo here [1]

[0] https://github.com/pipecat-ai/pipecat-esp32

[1] https://www.youtube.com/watch?v=6f0sUEUuruw

That seems to be the issue: VAD is insufficient here.

Knowing when to respond requires semantic understanding, which probably only the model itself is capable enough.

Maybe it’s hard for them to train it to only respond once it seems appropriate to do so?

I am excited for VAD to go away. PersonaPlex totally seems like the future.

However things like 'Call center helpline' turn based actually seems better! You don't want to be interrupted when giving information back and forth (I think?)

I think he’s saying they are doing an insane level of complexity to shave ~100ms off response times in a scenario where that isn’t important and might even be a negative

When GP mentioned reducing conversational latency as a negative that made sense (and should probably be done IMO), it just wasn't the same category of latency the article talks about reducing. I.e. increasing "network latency" just makes the conversation feel more and more out of sync, it doesn't change the rate at which the AI will interrupt ("turn latency") because the latter is based on the duration of the pause in the audio stream, not the duration it took to deliver that audio stream.

If you meant there is a case where reducing the network latency at the same delivery reliability for a given audio stream is actually a negative then I'd love to hear more about it as I'm a network guy always in search of an excuse for latency :D.

By you want to be able to interject “hold on…” and have it immediately stop talking, when it goes off the rails.

And GP is correctly pointing out that the only negative here (silence waiting latency maybe being too low) is tunable separately from the network latency number.

100%. I have to hold the floor by filling the space with "ummmmmmmm.... uhhhh...." which inevitably distracts me from my point altogether. Poor user experience.

Seems like there's a big risk of having that habit leak into human conversation. A lot of people try really hard to train themselves not to add those fillers.

Have you tried telling it to pause to let you think?

I often use it while I’m walking and tell it to not respond until I initiate a conversation.

I’ve tried this and it says it will but just keeps cutting in. I hate this feature so much.

It seems that tool calling shouldn't be 500ms of latency?

If you have tool calling complex enough that it necessitates a higher reasoning level, and you would otherwise have reasoning set to "none", this can easily come out to 500ms.

This is valid criticism. Go fanbois don't like listening to any go criticism. They were all like who needs templates in go. and now go has templates.

To me go code looks like somebody vomitted stuff in the root dir and i have to wade through that every time. No namespacing. nothing

I don't like go as a personal preference but reducing them to "fanboys" is a bit reductive. I'm sure the same could be said about your own favorite language.

Ok... The question was why is it like that. The answer is because it's in go. Nobody was anything other than civil before you neckbearded in here. Chill. There's a sane way to say what you said.

Thanks for reading it!

You can't beat Websockets :) Especially since you have so much tooling/existing stuff that works with HTTP.

I have been trying to get a website off the ground that does Datachannels + SQlite in the browser and then users sync between each other. I have gotten distracted so many times though.

Super cool! Please give me a ping if you ever launch that (my email is on my website (in my profile))

Voice AI only feels natural if conversation moves at the speed of speech. When the network gets in the way, people hear it immediately as awkward pauses, clipped interruptions, or delayed barge-in. That matters for ChatGPT voice, for developers building with the Realtime API, for agents working in interactive workflows, and for models that need to process audio while a user is still talking.

At OpenAI’s scale, that translates into three concrete requirements:

Global reach for more than 900 million weekly active users
Fast connection setup so a user can start speaking as soon as a session begins
Low and stable media round-trip time, with low jitter and packet loss, so turn-taking feels crisp

The team at OpenAI responsible for real-time AI interactions recently rearchitected our WebRTC stack to address three constraints that started to collide at scale: one-port-per-session media termination does not fit OpenAI infrastructure well, stateful ICE (Interactive Connectivity Establishment) and DTLS (Datagram Transport Layer Security) sessions need stable ownership, and global routing has to keep first-hop latency low. In this post, we walk through the split relay plus transceiver architecture we built to preserve standard WebRTC behavior for clients while changing how packets are routed inside OpenAI’s infrastructure.

WebRTC is an open standard for sending low-latency audio, video, and data between browsers, mobile apps, and servers. It’s often associated with peer-to-peer calling, but it’s also a practical foundation for client-to-server real-time systems because it standardizes the hard parts of interactive media: ICE for connectivity establishment and NAT (Network Address Translation) traversal, DTLS and SRTP (Secure Real-time Transport Protocol) for encrypted transport, codec negotiation for compressing and decoding audio, RTCP (Real-time Transport Control Protocol) for quality control, and client-side features such as echo cancellation and jitter buffering.

That standardization matters for AI products. Without WebRTC, every client would need a different answer for how to establish connectivity across NATs, encrypt media, negotiate codecs (the coder-decoders selected for transmission and decompression) and adapt to changing network conditions. With WebRTC, we can build on a protocol stack that’s already implemented across browsers and mobile platforms, focusing our own work on the infrastructure that connects real-time media to models.

We also build on the WebRTC ecosystem itself, including mature open-source implementations and the standard work that keeps browsers, mobile apps, and servers interoperable. Foundational work by Justin Uberti (one of WebRTC’s original architects) and Sean DuBois (creator and maintainer of Pion) made it possible for teams like ours to build on battle-tested media infrastructure rather than reinvent low-level transport, encryption, and congestion-control behavior. We’re fortunate that both Justin and Sean are now colleagues here at OpenAI, helping guide how we bring WebRTC and real-time AI closer together.

For AI, the most important property is that audio arrives as a continuous stream. A spoken agent can begin transcribing, reasoning, calling tools, or generating speech while the user is still talking, instead of waiting for a full upload. That’s the difference between a system that feels conversational and one that feels like push-to-talk.

Once we chose WebRTC, the next question was where to terminate it (where we’d accept and own the WebRTC connection—for example, at the edge) and how to connect those sessions to the inference backend. Termination matters because it determines how we handle real-time session state, media transport, routing, latency, and failure isolation.

An SFU, or selective forwarding unit, is a media server that receives one WebRTC stream from each participant and selectively forwards streams to the others. In this model, the SFU terminates a separate WebRTC connection for every participant, and the AI joins as another participant in the session. That can be a good fit for products that are inherently multiparty, such as group calls, classrooms, or collaborative meetings. It keeps audio codecs, RTCP messages, data channels, recording, and per-stream policy in one place.1

Even in client-to-AI products, an SFU is often the default starting point because it lets teams reuse one proven system for signaling, media routing, recording, observability, and future extensions such as human handoff or adding more participants.

Our workload is different. Most sessions are 1:1—one user talking to one model, or one application talking to one real-time agent—with latency sensitivity on every turn. For that shape of traffic, we chose a transceiver model: a WebRTC edge service terminates the client connection and then converts media and events into simpler internal protocols for model inference, transcription, speech generation, tool use, and orchestration.

In this design, the transceiver is the only service that owns the WebRTC session state, including ICE connectivity checks, the DTLS handshake, SRTP encryption keys, and session lifecycle. “Termination” here means the transceiver is the endpoint that completes those handshakes and encrypts or decrypts the media. Keeping that state in one place made session ownership easier to reason about, and it let backend services scale like ordinary services instead of acting as WebRTC peers themselves.

After choosing the transceiver model, our first implementation was a single Go service built on Pion that handled both signaling and media termination. It powers ChatGPT voice, the Realtime API’s WebRTC endpoint, and a number of research projects.

Operationally, the transceiver service does two jobs:

Signaling: SDP negotiation, codec selection, ICE credentials, and session setup
Media: Terminating downstream WebRTC connections and maintaining upstream connections to backend services for inference and orchestration

We wanted the service to run like the rest of our infrastructure: on Kubernetes, where workloads can scale up and down, and move across hosts as demand changes. But the conventional one-port-per-session WebRTC model fits that environment poorly, because it depends on large public UDP port ranges that are difficult to expose, secure, and preserve as pods are added, removed, or rescheduled.2

The first problem was the one-port-per-session model itself. At high concurrency, that means exposing and managing very large UDP port ranges.

Cloud load balancers and Kubernetes services are not designed around tens of thousands of public UDP ports per service. Each additional range adds operational complexity in load balancer config, health checking, firewall policy, and rollout safety.3
Large UDP port ranges are hard to secure because they expand the externally reachable surface area and make network policy harder to audit.
They’re also a poor fit for autoscaling. Pods are constantly added, removed, and rescheduled in Kubernetes. Requiring each pod to reserve and advertise a large stable port range makes that elasticity brittle.4

This is why many WebRTC systems move toward a single UDP port per server, with application-level demultiplexing behind that port.5

Single-port-per-server designs solve port count, but they introduce a second problem: preserving ownership of each session across a fleet.

ICE and DTLS are stateful protocols. The process that created a session needs to keep receiving that session’s packets so it can validate connectivity checks, complete the DTLS handshake, decrypt SRTP, and process later session changes such as ICE restarts. If packets for the same session land on a different process, setup can fail or media can break.

That gave us a specific target: expose a small, fixed UDP surface to the public internet, while still routing every packet to the transceiver that owns the corresponding WebRTC session.

We evaluated several ways to get there, including TURN (Traversal Using Relays around NAT), where an edge relay terminates client allocations and forwards traffic on their behalf.2

Approach

Pros

Cons

Unique IP:port per session (also known as native direct UDP)

Direct client-to-server media path

No forwarding layer in the data path

Requires one public UDP port per session

Large port ranges are difficult to expose and secure

Poor fit for Kubernetes and cloud load balancers

Unique IP:port per server

Much smaller public UDP footprint than per-session exposure

One shared socket per server can demultiplex many sessions

Works cleanly on a single host, but not across a shared load-balanced fleet by itself

Session demultiplexing on a single host only helps after a packet reaches that host; across a load-balanced fleet, the first packet can still land on the wrong instance, so you still need a deterministic way to steer each session to the process that owns it

TURN relay (protocol-terminating)

Clients only need to reach the TURN relay address and port

Can centralize policy at the edge

TURN allocations add setup round trips

Moving or recovering allocations across TURN servers is still difficult

Stateless forwarder + stateful terminator (OpenAI’s relay + transceiver)

Small public UDP footprint

Transceiver still owns the full WebRTC session

Adds one forwarding hop before media reaches the owning transceiver

Requires custom coordination between relay and transceiver

The architecture we shipped splits packet routing from protocol termination. Signaling still reaches the transceiver for session setup, while media enters through the relay first. The relay is a lightweight UDP forwarding layer with a small public footprint, and the transceiver is the stateful WebRTC endpoint behind it.

The relay does not decrypt media, run ICE state machines, or participate in codec negotiation. It reads enough packet metadata to choose a destination, then forwards the packet to the transceiver that owns the session. The transceiver still sees a normal WebRTC flow and still owns all protocol state. From the client’s perspective, nothing about the WebRTC session changes.

First-packet routing is the key step in this setup. A relay has to route the first packet from a client before any session exists on the packet path itself rather than by pausing on an external lookup service.

Every WebRTC session already carries a protocol-native routing hook: the ICE username fragment, or ufrag, a short identifier exchanged during session setup and echoed in STUN connectivity checks. We generate the server-side ufrag so it contains just enough routing metadata for relay to infer the destination cluster and owning transceiver.

During signaling, the transceiver allocates session state and returns a shared relay VIP and UDP port in the SDP answer. A VIP is a virtual IP address fronting the relay fleet; combined with the port, it gives the client a single stable destination, such as `203.0.113.10:3478`, even though many relay instances sit behind it. The client’s first media-path packet is usually a STUN (Session Traversal Utilities for NAT) binding request, which ICE uses to verify that packets can reach the advertised address.

Relay parses just enough of that first STUN packet to read the server ufrag, decode the routing hint, and forward the packet to the owning transceiver. Each transceiver listens on a shared UDP socket, meaning one operating system endpoint bound to an internal IP:port, not one socket per session. After the relay creates a session from the client’s source IP:port to that transceiver destination, subsequent DTLS, RTP, and RTCP packets flow within the session without re-decoding the ufrag.

The relay’s session is purposefully minimal, consisting only of an in-memory session to inform packet forwarding, along with necessary counters for monitoring and timers for session expiration and cleanup. This design choice maintains packet routing directly on the packet path. If a relay restarts and loses the session, the next STUN packet rebuilds the session from the ufrag routing hint. To make it even more reliable, a Redis cache is employed to hold the mapping of <client IP + Port, transceiver IP + Port> once the route is established so that it can be recovered much earlier, before the next STUN packet arrives.

Once we reduced the public UDP surface to a small number of stable addresses and ports, we could deploy the same relay pattern globally. Global Relay is our fleet of geographically distributed relay ingress points that all implement the same packet-forwarding behavior.

Broad geographic ingress shortens the first client-to-OpenAI hop because a packet can enter our network at a relay close to the user, in both geography and network topology, instead of crossing the public internet to a distant region first. In practical terms, that means lower latency, less jitter, and fewer avoidable loss bursts before traffic reaches our backbone.6

We use Cloudflare geo and proximity steering for signaling so the initial HTTP or WebSocket request reaches a nearby transceiver cluster. The request context dictates the session’s location and which Global Relay ingress point is advertised to the client. The SDP answer provides the Global Relay address, while the ufrag contains sufficient information for Global Relay to route media to the designated cluster and relay to route to the destination transceiver.

Together, geo-steered signaling and Global Relay put both setup and media on a nearby entry path while keeping the session anchored to one transceiver. That reduces the round-trip time for signaling and for the first ICE connectivity check, which directly shortens how long a user waits before speech can start.

We wrote the relay service in Go and kept the implementation narrow on purpose. On Linux, the kernel’s networking stack receives UDP packets from the machine’s network interface and delivers them to a socket, the operating system endpoint that a process reads after binding an IP:Port. Relay runs in userspace, so a regular Go process reads packet headers from that socket, updates a small amount of flow state, and forwards packets without terminating WebRTC. We did not need any kernel-bypass framework, which would let a userspace process poll network queues directly for higher packet rates but also add operational complexity.

Key design choices:

No protocol termination: Relay parses only STUN headers/ufrag; it uses cached state for subsequent DTLS, RTP, and RTCP, keeping packets opaque.
Ephemeral state: It maintains a small, short-timeout, in-memory map of client address to transceiver destination for flow state and observability.
Horizontal scalability: Multiple relay instances run in parallel behind a load balancer. State is not hard WebRTC state, so restarts cause minimal traffic drops and quick flow recovery.

Efficiency measures:

SO_REUSEPORT is a Linux socket option that allows multiple relay workers on the same machine to bind the same UDP port. The kernel then distributes incoming packets across those workers, which avoids a single read-loop bottleneck.
runtime.LockOSThread pins each UDP-reading goroutine to a specific OS thread. Combined with SO_REUSEPORT, that tends to keep packets from the same flow (the source and destination IP:Port plus protocol) on the same CPU core, improving cache locality and reducing context switching.
Pre-allocated buffers and minimal copying keep parsing and allocation overhead low to avoid garbage collection in Go.

This implementation handled our global real-time media traffic with a relatively small relay footprint, so we kept the simpler design instead of taking on a kernel bypass route.

This architecture lets us run WebRTC media in Kubernetes without exposing thousands of UDP ports. That matters because a smaller and fixed UDP surface is easier to secure and load balance, and it lets the infrastructure scale without reserving large public port ranges. With better infra support from Kubernetes and more security due to smaller surface area, this design also preserves standard WebRTC behavior for clients and confirms that an SFU-less design was the right default for our workload. Most of our sessions are point-to-point, latency-sensitive, and easier to scale when inference services don’t need to behave like WebRTC peers.

The broader lesson is that the best place to add complexity is in a thin routing layer, not in every backend service, and not in custom client behavior. Encoding routing metadata into a protocol-native field gave us deterministic first-packet routing, a small public UDP footprint, and enough flexibility to place ingress close to users around the world.

A few choices were especially important:

Preserve protocol semantics at the edge. Clients still speak standard WebRTC, which keeps browser and mobile interoperability intact.
Keep hard session states in one place. Transceiver owns ICE, DTLS, SRTP, and session lifecycle; relay only forwards packets.
Route on information already present in setup. The ICE ufrag gave us a first-packet routing hook without adding a hot-path lookup dependency.
Optimize for the common case before reaching for kernel bypass. A narrow Go implementation with careful use of SO_REUSEPORT, thread pinning, and low-allocation parsing was enough for our workload.

Real-time voice AI only works when infrastructure makes latency feel invisible. For us, that meant changing the shape of our WebRTC deployment without changing what clients expect from WebRTC itself.