This new company chose a very confusing name that has been used by the Open UI W3C Community Group for over 5 years.
Open UI is the standards group responsible for HTML having popovers, customizable select, invoker commands, and accordions. They're doing great work.
So this holds even for L = M. The speedup is not in the language, but in the rewriting and rethinking.
Claude tells me this is https://www.fumadocs.dev/
So you're reinventing JSON but binary? V8 JSON nowadays is highly optimized [1] and can process gigabytes per second [2], I doubt it is a bottleneck here.
[1] https://v8.dev/blog/json-stringify [2] https://github.com/simdjson/simdjson
It looks like neither is the "real win". both the language and the algorithm made a big difference, as you can see in the first column in the last table - going to wasm was a big speedup, and improving the algorithm on top of that was another big speedup.
Thanks for cutting through the clickbait. The post is interesting, but I'm so tired of being unnecessarily clickbaited into reading articles.
They say they measured that cost, and it was most of the runtime in the old version (though they don't give exact numbers). That cost does not exist at all in the new version, simply because of the language.
I don't think that's actually out yet, and more importantly, it doesn't change anything at runtime -- your code still runs in a JS engine (V8, JSC etc).
One thing I noticed was that they time each call and then use a median. Sigh. In a browser. :/ With timing attack defenses build into the JS engine.
Edit: fixed phone typos
You can use it today.
> converts internal AST into the public OutputNode format consumed by the React renderer
Why not just have the LLM emit the JSON for OutputNode ? Why is a custom "language" and parser needed at all? And yes, there is a cost for marshaling data, so you should avoid doing it where possible, and do it in large chunks when its not possible to avoid. This is not an unknown phenomenon.
Looks inside
“The old implementation had some really inappropriate choices.”
Every time.
Anyway, Javascript is no stranger to breaking changes. Compare Chromium 47 to today. Just add actual integers as another breaking change, then WASM becomes almost unnecessary.
In their worst case it was just x5. We clearly have some progress here.
Rust.
WASM.
TypeScript.
I am slowly beginning to understand why WASM did not really succeed.
For a parser specifically, you're probably spending a lot of time creating and discarding small AST nodes. That's exactly the kind of workload where V8's generational GC shines and where WASM's manual memory management becomes a liability rather than an asset.
The interesting question is whether this scales. A parser that runs on small inputs in a browser is a very different beast from one processing multi-megabyte files in a tight loop. At some point the WASM version probably wins - the question is whether that workload actually exists in your product.
The port had been done in a weekend just to see if we could use Python in production. The C++ code had taken a few months to write. The port was pretty direct, function for function. It was even line for line where language and library differences didn't offer an easier way.
A couple of us worked together for a day to find the reason for the speedup. Just looking at the code didn't give us any clues, so we started profiling both versions. We found out that the port had accidentally fixed a previously unknown bug in some code that built and compared cache keys. After identifying the small misbehaving function, we had to study the C++ code pretty hard to even understand what the problem was. I don't remember the exact nature of the bug, but I do remember thinking that particular type of bug would be hard to express in Python, and that's exactly why it was accidentally fixed.
We immediately started moving the rest of our back end to Python. Most things were slower, but not by much because most of our back end was i/o bound. We soon found out that we could make algorithmic improvements so much more quickly, so a lot of the slowest things got a lot faster than they had ever been. And, most importantly, we (the software developers) got quite a bit faster.
Additionally even after those options are exhausted, only a key parts might need a rewrite, not the whole thing.
However, I wonder how many care about actually learning about algorithms, data structures and mechanical sympathy in the age of Electron apps.
It feels quite often that a rewrite is chosen, because knowing how to actually apply those skills is the CS stuff many think isn't worthwhile learning about.
That final summary benchmark means nothing. It mentions 'baseline' value for the 'Full-stream total' for the rust implementation, and then says the `serde-wasm-bindgen` is '+9-29% slower', but it never gives us the baseline value, because clearly the only benchmark it did against the Rust codebase was the per-call one.
Then it mentions: "End result: 2.2-4.6x faster per call and 2.6-3.3x lower total streaming cost."
But the "2.6-3.3x" is by their own definition a comparison against the naive TS implementation.
I really think the guy just prompted claude to "get this shit fast and then publish a blog post".
Not sold about the fundamental idea of OpenUI though. XML is a great fit for DSLs and UI snippets.
I didn't mind reading articles that are not about how Rust is great in theory (and maybe practice).
You still do get some latency from the event loop, because postMessage gets queued as a MacroTask, which is probably on the order of 10μs. But this is the price you have to pay if you want to run some code in a non-blocking way.
Pure speculation, but I would guess this has something to do with a copy constructor getting invoked in a place you wouldn't guess, that ends up in a critical path.
This was particularly true for one of the projects I've worked with in the past, where Python was chosen as the main language for a monitoring service.
In short, it proved itself to be a disaster: just the Python process collecting and parsing the metrics of all programs consumed 30-40% of the processing power of the lower end boxes.
In the end, the project went ahead for a while more, and we had to do all sorts of mitigations to get the performance impact to be less of an issue.
We did consider replacing it all by a few open source tools written in C and some glue code, the initial prototype used few MBs instead of dozens (or even hundreds) of MBs of memory, while barely registering any CPU load, but in the end it was deemed a waste of time when the whole project was terminated.
The most obvious approach would be to let LLMs generate code and render it but that introduces problems like safety, UI consistency and speed. OpenUI solves those problems and provides a safe, consistent and token optimized runtime for the LLMs to render live UI
It's true that writing code in C doesn't automatically make it faster.
For example, string manipulation. 0-terminated strings (the default in C) are, frankly, an abomination. String processing code is a tangle of strlen, strcpy, strncpy, strcat, all of which require repeated passes over the string looking for the 0. (Even worse, reloading the string into the cache just to find its length makes things even slower.)
Worse is the problem that, in order to slice a string, you have to malloc some memory and copy the string. And then carefully manage the lifetime of that slice.
The fix is simple - use length-delimited strings. D relies on them to great effect. You can do them in C, but you get no succor from the language. I've proposed a simple enhancement for C to make them work https://www.digitalmars.com/articles/C-biggest-mistake.html but nobody in the C world has any interest in it (which baffles me, it is so simple!).
Another source of slowdown in C is I've discovered over the years that C is not a plastic language, it is a brittle one. The first algorithm you select for a C project gets so welded into it that it cannot be changed without great difficulty. (And we all know that algorithms are the key to speed, not coding details.) Why isn't C plastic?
It's because one cannot switch back and forth between a reference type and a value type without extensively rewriting every use of it. For example:
struct S { int a; }
int foo(struct S s) { return s.a; }
int bar(struct S *s) { return s->a; }
If you want to switch between reference and value, you've got to go through all your code swapping . and ->. It's just too tedious and never happens. In D: struct S { int a; }
int foo(S s) { return s.a; }
int bar(S *s) { return s.a; }
I discovered while working on D that there is no reason for the C and C++ -> operator to even exist, the . operator covers both bases!I understand your frustration with AI writing though. We are a small team and given our roadmap it was either use LLMs to help collate all the internal benchmark results file into a blog or never write it so we chose the former. This was a genuinely surprising and counterintuitive result for us, which is why we wanted to share it. Happy to clarify any of the numbers if helpful.
You could also try pretty fast fft: https://github.com/JorenSix/pffft.wasm
The primary motivation was speed and schema cohesion. We were running a JSON based format, Thesys C1, in production for a year before we realized we cannot add features fast enough because we were fighting the LLMs at multiple levels. It's probably too much to write in a comment but we'd like to write about the motivation and all the things we tried ona a separate blog soon
That said, Rust does have real problems. Manual memory management sucks. People think GC is expensive? Well, keep in mind malloc() and free() take global locks! People just have totally bogus mental models of what drives performance. These models lead them to technical nonsense.
The other day, someone linked back to this 2018 post on finding a cache coherency bug in the Xbox 360 CPU:
https://randomascii.wordpress.com/2018/01/07/finding-a-cpu-d...
So much more genuinely engaging than any of the AI-“enhanced” sloppy, confused, trite writing that gets to the front page here daily because it’s been hyper-optimized for upvotes.
Crazy how many stories like this I’ve heard of how doing performance work helped people uncover bugs and/or hidden assumptions about their systems.
Never mind the age of Electron apps, even fewer care about those in the age of agents.
Would be kind of cool if e. g. python or ruby could be as fast as C or C++.
I wonder if this could be possible, assuming we could modify both to achieve that as outcome. But without having a language that would be like C or C++. Right now there is a strange divide between "scripting" languages and compiled ones.
> uv is fast because of what it doesn’t do, not because of what language it’s written in. The standards work of PEP 518, 517, 621, and 658 made fast package management possible. Dropping eggs, pip.conf, and permissive parsing made it achievable. Rust makes it a bit faster still.
We built the openui-lang parser in Rust and compiled it to WASM. The logic was sound: Rust is fast, WASM gives you near-native speed in the browser, and our parser is a reasonably complex multi-stage pipeline. Why wouldn't you want that in Rust?
Turns out we were optimising the wrong thing.
The openui-lang parser converts a custom DSL emitted by an LLM into a React component tree. It runs on every streaming chunk — so latency matters a lot. The pipeline has six stages:
autocloser → lexer → splitter → parser → resolver → mapper → ParseResult
id = expression statementsOutputNode format consumed by the React rendererEvery call to the WASM parser pays a mandatory overhead regardless of how fast the Rust code itself runs:
JS world WASM world
────────────────────────────────────────────────────────
wasmParse(input)
│
├─ copy string: JS heap → WASM linear memory (allocation + memcpy)
│
│ Rust parses ✓ fast
│ serde_json::to_string() ← serialize result
│
├─ copy JSON string: WASM → JS heap (allocation + memcpy)
│
JSON.parse(jsonString) ← deserialize result
│
return ParseResult
The Rust parsing itself was never the slow part. The overhead was entirely in the boundary: copy string in, serialize result to JSON string, copy JSON string out, then V8 deserializes it back into a JS object.
The natural question was: what if WASM returned a JS object directly, skipping the JSON serialization step? We integrated serde-wasm-bindgen which does exactly this — it converts the Rust struct into a JsValue and returns it directly.
It was 30% slower.
Here's why. JS cannot read a Rust struct's bytes from WASM linear memory as a native JS object — the two runtimes use completely different memory layouts. To construct a JS object from Rust data, serde-wasm-bindgen must recursively materialise Rust data into real JS arrays and objects, which involves many fine-grained conversions across the runtime boundary per parse() invocation.
Compare that to the JSON approach: serde_json::to_string() runs in pure Rust with zero boundary crossings, produces one string, one memcpy copies it to the JS heap, then V8's native C++ JSON.parse processes it in a single optimised pass. Fewer, larger, and more optimised operations win over many small ones.
| Fixture | JSON round-trip | serde-wasm-bindgen | Change |
|---|---|---|---|
| simple-table | 20.5 | 22.5 | -9% slower |
| contact-form | 61.4 | 79.4 | -29% slower |
| dashboard | 57.9 | 74.0 | -28% slower |
We reverted this change immediately.
We ported the full parser pipeline to TypeScript. Same six-stage architecture, same ParseResult output shape — no WASM, no boundary, runs entirely in the V8 heap.
What is measured: A single parse(completeString) call on the finished output string. This isolates per-call parser cost.
How it was run: 30 warm-up iterations to stabilise JIT, then 1000 timed iterations using performance.now() (µs precision). The median is reported. Fixtures are real LLM-generated component trees serialised in each format's real streaming syntax.
Fixtures:
simple-table — root + one Table with 3 columns and 5 rows (~180 chars)contact-form — root + form layout with 6 input fields + submit button (~400 chars)dashboard — root + sidebar nav + 3 metric cards + chart + data table (~950 chars)| Fixture | TypeScript | WASM | Speedup |
|---|---|---|---|
| simple-table | 9.3 | 20.5 | 2.2x |
| contact-form | 13.4 | 61.4 | 4.6x |
| dashboard | 19.4 | 57.9 | 3.0x |
Eliminating WASM fixed the per-call cost, but the streaming architecture still had a deeper inefficiency.
The parser is called on every LLM chunk. The naïve approach accumulates chunks and re-parses the entire string from scratch each time:
Chunk 1: parse("root = Root([t") → 14 chars
Chunk 2: parse("root = Root([tbl])\ntbl = T") → 27 chars
Chunk 3: parse(full_accumulated_string) → ...
For a 1000-char output delivered in 20-char chunks: 50 parse calls processing a cumulative total of ~25,000 characters. O(N²) in the number of chunks.
Statements terminated by a depth-0 newline are immutable — the LLM will never come back and modify them. We added a streaming parser that caches completed statement ASTs:
State: { buf, completedEnd, completedSyms, firstId }
On each push(chunk):
1. Scan buf from completedEnd for depth-0 newlines
2. For each complete statement found: parse + cache AST → advance completedEnd
3. Pending (last, incomplete) statement: autoclose + parse fresh
4. Merge cached + pending → resolve + map → return ParseResult
Completed statements are never re-parsed. Only the trailing in-progress statement is re-parsed per chunk. O(total_length) instead of O(N²).
What is measured: The total parse overhead accumulated across every chunk call for one complete document. This is different from the one-shot benchmark — it measures the sum of all parse calls during a real stream, not a single call. This is the number that affects actual user-perceived responsiveness.
How it was run: Documents are replayed in 20-char chunks. Each chunk triggers a parse() (naïve) or push() (incremental) call. Total time across all calls is recorded. 100 full-stream replays, median taken.
| Fixture | Naïve TS (re-parse every chunk) | Incremental TS (cache completed) | Speedup |
|---|---|---|---|
| simple-table | 69 | 77 | none (single statement, no cache benefit) |
| contact-form | 316 | 122 | 2.6x |
| dashboard | 840 | 255 | 3.3x |
The simple-table fixture is a single statement — there's nothing to cache, so both approaches are equivalent. The benefit scales with the number of statements because more of the document gets cached and skipped on each chunk.
The one-shot table shows 13.4µs for contact-form; the streaming table shows 316µs (naïve). These are not contradictory — they measure different things:
parse() call on the complete 400-char stringparse() calls during the stream (chunk 1 parses 20 chars, chunk 2 parses 40 chars, ..., chunk 20 parses 400 chars — cumulative sum of all those growing calls)| Approach | Per-call cost | Full-stream total | Notes |
|---|---|---|---|
| WASM + JSON round-trip | 20-61µs | baseline | Copy overhead each call |
| WASM + serde-wasm-bindgen | 22-79µs | +9-29% slower | Hundreds of internal boundary crossings |
| TypeScript (naïve re-parse) | 9-19µs | 69-840µs | No boundary, but O(N²) streaming |
| TypeScript (incremental) | 9-19µs | 69-255µs | No boundary + O(N) streaming |
End result: 2.2-4.6x faster per call and 2.6-3.3x lower total streaming cost.
This experience sharpened our thinking on the right use cases for WASM:
✅ Compute-bound with minimal interop: image/video processing, cryptography, physics simulations, audio codecs. Large input → scalar output or in-place mutation. The boundary is crossed rarely.
✅ Portable native libraries: shipping C/C++ libraries (SQLite, OpenCV, libpng) to the browser without a full JS rewrite.
❌ Parsing structured text into JS objects: you pay the serialization cost either way. The parsing computation is fast enough that V8's JIT eliminates any Rust advantage. The boundary overhead dominates.
❌ Frequently-called functions on small inputs: if the function is called 50 times per stream and the computation takes 5µs, you cannot amortise the boundary cost.
Profile where time is actually spent before choosing the implementation language. For us, the cost was never in the computation - it was always in data transfer across the WASM-JS boundary.
"Direct object passing" through serde-wasm-bindgen is not cheaper. Constructing a JS object field-by-field from Rust involves more boundary crossings than a single JSON string transfer, not fewer. The boundary crossings happen inside the single FFI call, invisibly.
Algorithmic complexity improvements dominate language-level optimisations. Going from O(N²) to O(N) in the streaming case had a larger practical impact than switching from WASM to TypeScript.
WASM and JS do not share a heap. WASM has a flat linear memory (WebAssembly.Memory) that JS can read as raw bytes, but those bytes are Rust's internal layout - pointers, enum discriminants, alignment padding - completely opaque to the JS runtime. Conversion is always required and always costs something.
You are not the same.
Turns out the metrics just rounded to the nearest 5MB
Just write the parsing loop in something faster like C or Rust, instead of the whole thing.
And those will still care about CS.
So the claim is not well supported at all by the article as you stated, in fact the claim is literally disproven by the article.
The main lesson of the story. Just pick Python and move fast, kids. It doesn’t matter how fast your software is if nobody uses it.
I suspect it’s more likely to be something like passing std::string by value not realising that would copy the string every time, especially with the statement that the mistake would be hard to express in Python.
They found that they had fewer bugs in Python so they continued with it.
So it's more so a story about architectural mistakes.
I'd rather not use python. The ick gets me every time.
That has not been my experience. JS/TS requires the most hand-holding, by far. LLMs are no doubt assumed to be good at JS due to the sheer amount of training data, but a lot of those inputs are of really poor quality, and even among the high quality inputs there isn't a whole lot of consistency in how they are written. That seems to trip up the LLMs. If anything, LLMs might finally be what breaks the JS camel's back. Although browser dominance still makes that unlikely.
> Very few people will then take the pain of optimizing it
Today's LLMs rarely take the initiative to write benchmarks, but if you ask it will and then will iterate on optimizing using the benchmark results as feedback. It works fairly well. There is a conceivable near future where LLMs or LLM tools will start doing this automatically.
Not because they are brilliant, but because they are pretty good at throwing pretty much all known techniques at a problem. And they also don't tire of profiling and running experiments.
This is an alternative to json-render by Vercel or A2UI by Google which I'm guessing the flutter implementation is based on
The reason nobody uses your software could be that it is too slow. As an example, if you write a video encoder or decoder, using pure Python might work for postage-stamp sized video because today’s hardware is insanely fast, but even, it likely will be easier to get the same speed in a language that’s better suited to the task.
Meanwhile my experience has been that whenever there has been a performance issue severe enough to actually matter, it's often been the result of some kind of performance bug, not so much language, runtime, or even algorithm choices for that matter.
Hence whenever the topic of how to improve performance comes up, I always, always insist that we profile first.
But yes I see what you mean and I think people are trying to solve it with skills and harnesses at the application layer but its not there yet
If you're writing FastAPI (and you should be if you're doing a greenfield REST API project in Python in 2026), just s/copy/steal/ what those guys are doing and you'll be fine.
Recently I tried Codex/GPT5 with updating a bluetooth library for batteries and it was able to start capturing bluetooth packets and comparing them with the libraries other models. It was indefatigable. I didn't even know if was so easy to capture BLE packets.
If you have a comprehensive test suite or a realistic benchmark, saying "make tests pass" or "make benchmark go up" works wonders.
LLMs are really good at knowing patterns, we still need programmers to know which pattern to apply when. We'll soon reach a point where you'll be able to say "X is slow, do autoresearch on X" and X will just magically get faster.
The reason we can't yet isn't because LLMs are stupid, it's because autoresearch is a relatively new (last month or so) concept and hasn't yet entered into LLM pretraining corpora. LLMs can already do this, you just need to be a little bit more explicit in explaining exactly what you need them to do.
One of the reasons the project was killed was that we couldn't port it to our line of low powered devices without a full rewrite in C.
Please note this was more than a decade ago, way before Rust was the language it was today. I wouldn't chose anything else besides Rust today since it gives the best of both worlds: a truly high level language with low level resource controls.
The mentality was "the language is fast, so as long as it compiles we're good"... Yeah that worked out about as well as you'd expect.
But, of course, profiling is always step one.
Flakey internet connection: most of current 'soy devs' would be useless. Even more with boosted up chatbots.