I ported JustHTML from Python to JavaScript with Codex CLI and GPT-5.2 in hours

For converting html to markdown in php markydown is pretty good: https://devkram.de/markydown/

I think the most interesting thing about this is how it demonstrates that a very particular kind of project is now massively more feasible: library porting projects that can be executed against implementation-independent tests.

The big unlock here is https://github.com/html5lib/html5lib-tests - a collection of 9,000+ HTML5 parser tests that are their own independent file format, e.g. this one: https://github.com/html5lib/html5lib-tests/blob/master/tree-...

The Servo html5ever Rust codebase uses them. Emil's JustHTML Python library used them too. Now my JavaScript version gets to tap into the same collection.

This meant that I could set a coding agent loose to crunch away on porting that Python code to JavaScript and have it keep going until that enormous existing test suite passed.

Sadly conformance test suites like html5lib-tests aren't that common... but they do exist elsewhere. I think it would be interesting to collect as many of those as possible.

The oracle approach mentioned downthread is what makes this practical even without conformance test suites. Run the original, capture input/output pairs, use those as your tests. Property-based testing tools like Hypothesis can generate thousands of edge cases automatically.

For solo devs this changes the calculus entirely. Supporting multiple languages used to mean maintaining multiple codebases - now you can treat the original as canonical and regenerate ports as needed. The test suite becomes the actual artifact you maintain.

Few know that Firefox's HTML5 parser was originally written in Java, and only afterward semi-mechanically translated (pre-LLMs) to the dialect of C++ used in the Gecko codebase.

This blog post isn't really about HTML parsers, however. The JustHTML port described in this blog post was a worthwhile exercise as a demonstration on its own.

Even so, I suspect that for this particular application, it would have been more productive/valuable to port the Java codebase to TypeScript rather than using the already vibe coded JustHTML as a starting point. Most of the value of what is demonstrated by JustHTML's existence in either form comes from Stenström's initial work.

> Code is so cheap it’s practically free. Code that works continues to carry a cost, but that cost has plummeted now that coding agents can check their work as they go.

I personally think that even before LLMs, the cost of code wasn't necessarily the cost of typing out the characters in the right order, but having a human actually understand it to the extent that changes can be made. This continues to be true for the most part. You can vibe code your way into a lot of working code, but you'll inevitably hit a hairy bug or a real world context dependency that the LLM just cannot solve, and that is when you need a human to actually understand everything inside out and step in to fix the problem.

From original repository:

     Verified Compliance: Passes all 9k+ tests in the official html5lib-tests suite (used by browser vendors).

Yes, browsers do you use it. But they handle a lot of stuff differently.

    selectolax  68%  No  Very Fast  CSS selectors C-based (Lexbor). Very fast but less compliant.

The original author compares selectolax to html5lib-tests, but the reality is that when you compare selectolax to Chrome output, you get 90%+.

One of the tests:

  INPUT: <svg><foreignObject></foreignObject><title></svg>foo

It fails for selectolax:

  Expected:
  | <html>
  |   <head>
  |   <body>
  |     <svg svg>
  |       <svg foreignObject>
  |       <svg title>
  |     "foo"
  Actual:
  | <html>
  |   <head>
  |   <body>
  |     <svg>
  |       <foreignObject>
  |       <title>
  |     "foo"

But you get this in Chrome and selectolax:

    <html><head></head><body><svg><foreignObject></foreignObject><title></title></svg>foo
    </body></html>

Remarkable that it echoes, from a different angle, this post from just a few days ago on HN:

https://martinalderson.com/posts/has-the-cost-of-software-ju...

This last post was largely dismissed in the comments here on HN. Simon's experiment brings new ground for the argument.

My opinion on the ending open questions:

> Does this library represent a legal violation of copyright of either the Rust library or the Python one? Even if this is legal, is it ethical to build a library in this way?

Currently, I am experimenting with two projects in Claude Code: a Rust/Python port of a Python repo which necessitates a full rewrite to get the desired performance/feature improvements, and a Rust/Python port of a JavaScript repo mostly because I refuse to install Node (the speed improvement is nice though).

In both of those cases, the source repos are permissively licensed (MIT), which I interpret as the developer intent as to how their code should used. It is in the spirit of open source to produce better code by iterating on existing code, as that's how the software ecosystem grows. That would be the case whether a human wrote the porting code or not. If Claude 4.5 Opus can produce better/faster code which has the same functionality and passes all the tests, that's a win for the ecosystem.

As courtesy and transparency, I will still link and reference the original project in addition to disclosing the Agent use, although those things aren't likely required and others may not do the same. That said, I'm definitely not using an agent to port any GPL-licensed code.

> How much better would this library be if an expert team hand crafted it over the course of several months?

i think the fun conclusion would be: ideally no better, and no worse. that is the state you arrive it IFF you have complete tests and specs (including probably for performance). now a human team handcrafting would undoubtedly make important choices not clarified in specs, thereby extending the spec. i would argue that human chain of thought from deep involvement in building and using the thing is basically 100% of the value of human handcrafting, because otherwise yeah go nuts giving it to an agent.

The biggest challenge an agent will face with tasks like these is the diminishing quality in relation to the size of the input, specifically I find input of above say 10k tokens dramatically reduced quality of generated output.

This specific case worked well, I suspect, since LLMs have a LOT of previous knowledge with HTML, and saw multiple impl and parsing of HTML in the training.

Thus I suspect that in real world attempts of similar projects and any non well domain will fail miserably.

The problem with translating between languages is that code that "looks the same and runs" are not equivalently idiomatic or "acceptable". It seems to turn into long files of if-statements, flags and checks and so on. This might be considered idiomatic enough in python, but not something you'd want to work with in functional or typed code.

While this example is explicitly asking for a port (thus a copy), I also find in general that LLM's default behavior is to spit out new code from their vast pre-trained encyclopedia, vs adding an import to some library that already serves that purpose.

I'm curious if this will implicitly drive a shift in the usage of packages / libraries broadly, and if others think this is a good or bad thing. Maybe it cuts down the surface of upstream supply-chain attacks?

Not all AI-assisted ports are quite so successful[0]

[0] https://ammil.industries/the-port-i-couldnt-ship/

This seems really impressive. I am too lazy to replicate this, but I do wonder how important the test suite is for a a port that likely uses straight forward, dependency free python code https://github.com/EmilStenstrom/justhtml/tree/main/src/just...

It is enormously useful for the author to know that the code works, but my intuition is if you asked an agent to port files slowly, forming its own plan, making commits every feature, it would still get reasonably close, if not there.

Basically, I am guessing that this impressive output could have been achieved based on how good models are these days with large amounts of input tokens, without running the code against tests.

"If you can reduce a problem to a robust test suite you can set a coding agent loop loose on it with a high degree of confidence that it will eventually succeed"

I'm a bit sad about this; I'd rather have "had fun" doing the coding, and get AI to create the test cases, than vice versa.

Couple quick points from the read - cool, btw! It's not trivial that Simon poked the LLM to get something up and running and working ASAP - that's always been a good engineering behavior in my opinion - building on a working core - but I have found it's extra helpful/needed when it comes to LLM coding - this brings the compiler and tests "in the loop" for the LLM, and helps keep it on the rails - otherwise you may find you get 1,000s of lines of code that don't work or are just sort of a goose chase, or all gilding of lilies.

As is mentioned in the comments, I think the real story here is two fold - one, we're getting longer uninterrupted productive work out of frontier models - yay - and a formal test suite has just gotten vastly more useful in the last few months. I'd love to see more of these made.

  >  It took two initial prompts and a few tiny follow-ups. GPT-5.2 running in Codex CLI ran uninterrupted for several hours, burned through 1,464,295 input tokens, 97,122,176 cached input tokens and 625,563 output tokens and ended up producing 9,000 lines of fully tested JavaScript across 43 commits.

Using a random LLM cost calculator, this amounts to $28.31... pretty reasonable for functional output.

I am now confident that within 5-10 years (most/all?) junior & mid and many senior dev positions are going to drop out enormously.

Source: https://www.llm-prices.com/#it=1464295&cit=97123000&ot=62556...

While I understand the intent of this exercise, couldn't someone just wasm compile the Servo html5ever Rust codebase?

I think specs + tests are the new source of truth, code is disposable and rebuildable. A well tested project is reliable both for humans and AI, a badly tested one is bad for both. When we don't test well I call it "vibe testing, or LGTM testing"

> Can I even assert copyright over this, given how much of the work was produced by the LLM?

No, because it's a derivative work of the base library.

Wild to ask, "Is it legal, ethical, responsible or even harmful to build in this way and publish it?" AFTER building and publishing it. Author made up his mind already, or doesn't actually care. Ethics and responsibility should guide one's actions, not just be engagement fodder after the fact.

What was your prompt to get it to run the test suite and heal tests at every step? I didn’t see that mentioned in your write up. Also, any specific reason you went with Codex over Claude Code?

I suppose a next experiment could be to reproduce sqlite from its test suite.

I think it is time for all HW vendors to open up their documentation so we can use AI for writing Drivers for niche OS.

There are many OSe out there suffering from the same problem. Lack of drivers.

AI can change it.

^Claude still thinks it's 2024. This happens to me consistently.

> How much better would this library be if an expert team hand crafted it over the course of several months?

It's an interesting assumption that an expert team would build a better library. I'd change this question to: would an expert team build this library better?

Another interesting experiment is to start from the html5lib-tests suite directly, instead of JustHTML. Worth another experiment?

Now do the same with Rust, build a Python wrapper and we went full circle :)

I think the decision of SQLite to keep its large test suite private is very wise in the presence of thieves.

YOU didn't port shit, the ai did all the work.

This specific case worked well, I suspect, since LLMs have a LOT of previous knowledge with HTML, and saw multiple impl and parsing of HTML in the training.

Thus I suspect that in real world attempts of similar projects and any non well domain will fail miserably.

In my experience it is closer to 25k, but that’s a minor point. What task do you need to do that requires more than that many tokens?

No, seriously. If you break your task into bite sized chunks, do you really need more than that at a time? I rarely do.