If the one-shot output resembles anything working (and I am betting it will), then obviously this isn't clean room at all.
Maybe a more sensible challenge would be to describe a system that hasn't previously been emulated before (or had an emulator source released publicly as far as you can tell from the internet) and then try it.
For fun, try using obscure CPUs giving it the same level of specification as you needed for this, or even try an imagined Z80-like but swapping the order of the bits in the encodings and different orderings for the ALU instructions and see how it manages it.
> The above tools could theoretically be used to compile, build, and bootstrap an entire FreeBSD, Linux, or other similar operating system kernel onto MMIX hardware, were such hardware to exist.
What if Agents were hip enough to recognize that they have navigated into a specialized area and need additional hinting? "I'm set up for CP/M development, but what I really need now is Z80 memory management technique. Let me swap my tool head for the low-level Z80 unit..."
We can throw RAGs on the pile and hope the context window includes the relevant tokens, but what if there were pointers instead?
I wish people would stop using this phrase altogether for LLM-assisted coding. It has a specific legal and cultural meaning, and the giant amount of proprietary IP that has been (illegally?) fed to the model during training completely disqualifies any LLM output from claiming this status.
I struggled a lot with some complex software, which worked on some emulators and failed on others (and mine).
For example one bug I had, which is still outstanding, relates to the Hisoft C compiler:
https://github.com/skx/cpmulator/issues/250
But I see that my cpm-dist repository is referenced in the download script so that made me happy!
It's great to see people still using CP/M, writing software for it, and sharing the knowledge. Though I do think the choice to implement the CCP in C, rather than using a genuine one, is an interesting one, and a bit of a cheat. It means that you cannot use "SUBMIT" and other common-place binaries/utilities.
> Address bits for pixel (x, y): > * 010 Y7 Y6 Y2 Y1 Y0 | Y5 Y4 Y3 X7 X6 X5 X4 X3
Which is wrong. It's x4-x0. Comment does not match the code below.
> static inline uint16_t zx_pixel_addr(int y, int col) {
It computes a pixel address with 0x4000 added to it only to always subtract 0x4000 from it later. The ZX apparently has ROM at 0x0000..0x3fff necessitating the shift in general but not in this case in particular.
This and the other inline function next to it for attributes are only ever used once.
> During the > * 192 display scanlines, the ULA fetches screen data for 128 T-states per > * line.
Yep.. but..
> Instead of a 69,888-byte lookup table
How does that follow? The description completely forgets to mention that it's 192 scan lines + 64+56 border lines * 224 T-States.
I'm bored. This is a pretty muddy implementation. It reminds me of the way children play with Duplo blocks.
As HN likes to say, only a amateur vibe-coder could believe this.
Better still invent a CPU instruction set, and get it to write an emulator for that instruction set in C.
Then invent a C-like HLL and get it to write a compiler from your HLL to your instruction set.
"This is, I think, in contradiction with the idea that LLMs are memorizing the whole training set and uncompress what they have seen. LLMs can memorize certain over-represented documents and code, but while they can extract such verbatim parts of the code if prompted to do so, they don’t have a copy of everything they saw during the training set, nor they spontaneously emit copies of already seen code, in their normal operation."
Can't things basically get baked into the weights when trained on enough iterations, and isn't this the basis for a lot of plagiarism issues we saw with regards to code and literature? It seems like this is maybe downplaying the unattributed use of open source code when training these models.
I tried creating an emulator for CPU that is very well known but lacks working open source emulators.
Claude, Codex and Gemini were very good at starting something that looked great but all failed to reach a working product. They all ended up in a loop where fixing one issues caused something else to break and could never get out of it.
IMHO zx_pixel_addr() is not bad, makes sense in this case. I'm a lot more unhappy with the actual implementation of the screen -> RGB conversion that uses such function, which is not as fast as it could be. For instance my own zx2040 emulator video RAM to ST77xx display conversion (written by hand, also on GitHub) is more optimized in this case. But the fact to provide the absolute address in the video memory is ok, instead of the offset. Just design.
> This and the other inline function next to it for attributes are only ever used once.
I agree with that but honestly 90% of the developers work in this way. And LLMs have such style for this reason. I stile I dislike as well...
About the lookup table, the code that it uses in the end was a hint I provided to it, in zx_contend_delay(). The old code was correct but extremely memory wasteful (there are emulators really taking this path of the huge lookup table, maybe to avoid the division for maximum speed), and there was the full comment about the T-states, but after the code was changed this half-comment is bad and totally useless indeed. In the Spectrum emulator I provided a few hints. In the Z80, no hint at all.
If you check the code in general, the Z80 implementation for instance, it is solid work on average. Normally after using automatic programming in this way, I would ask the agent (and likely Codex as well) to check that the comments match the documentation. Here, since it is an experiment, I did zero refinements, to show what is the actual raw output you get. And it is not bad, I believe.
P.S. I see your comment greyed out, I didn't downvote you.
WTF? I appreciate your technical expertise but you can't be aggressive like this on HN, and we've had to ask you this before: https://news.ycombinator.com/item?id=45663563.
If you'd please review https://news.ycombinator.com/newsguidelines.html and stick to the rules when posting here, we'd appreciate it.
- Based on classic Z80 architecture by Zilog - Inspired by modern RISC designs (ARM, RISC-V, MIPS)
In any case, an interesting experiment.
I tried asking Gemini and ChatGPT, "What opcode has the value 0x3c on the Intel 8048?"
They were both wrong. The datasheet with the correct encodings is easily found online. And there are several correct open source emulators, eg MAME.
I disagree that this is "aggressive." It's certainly opinionated. I think the AI does a bad job here and I'm attempting to express that in a humorous and qualified way.
> WTF?
You don't consider this to be "aggressive?"
> stick to the rules when posting here
Do you genuinely think I'm trying to be disruptive?
Funny enough, there is a 32-bit version of Z80 called Z380.
In fact this would make for an interesting benchmark - writing entire non-trivial apps based on the same prompt. Each model might be expected to write and use it's own test cases, but then all could be judged based on a common set of test cases provided as part of the benchmark suite.
Probably bonus points for telling it that you're emulating the well known ZX Spectrum and then describe something entire different and see whether it just treats that name as an arbitrary label, or whether it significantly influences its code generation.
But you're right of course, instruction decoding is a relatively small portion of a CPU that the differences would be quite limited if all the other details remained the same. That's why a completely hypothetical system is better.
Essentially they can't do clean room anything!
You might as well hire the entire former mid level of a businesses programming team and claim it's clean room work
An LLM by itself is like a lossy image of all text in the internet.
https://www.itprotoday.com/server-virtualization/windows-nt-...
antirez 3 days ago. 21234 views.
Anthropic recently released a blog post with the description of an experiment in which the last version of Opus, the 4.6, was instructed to write a C compiler in Rust, in a “clean room” setup.
The experiment methodology left me dubious about the kind of point they wanted to make. Why not provide the agent with the ISA documentation? Why Rust? Writing a C compiler is exactly a giant graph manipulation exercise: the kind of program that is harder to write in Rust. Also, in a clean room experiment, the agent should have access to all the information about well established computer science progresses related to optimizing compilers: there are a number of papers that could be easily synthesized in a number of markdown files. SSA, register allocation, instructions selection and scheduling. Those things needed to be researched *first*, as a prerequisite, and the implementation would still be “clean room”.
Not allowing the agent to access the Internet, nor any other compiler source code, was certainly the right call. Less understandable is the almost-zero steering principle, but this is coherent with a certain kind of experiment, if the goal was showcasing the completely autonomous writing of a large project. Yet, we all know how this is not how coding agents are used in practice, most of the time. Who uses coding agents extensively knows very well how, even never touching the code, a few hits here and there completely changes the quality of the result.
I thought it was time to try a similar experiment myself, one that would take one or two hours at max, and that was compatible with my Claude Code Max plan: I decided to write a Z80 emulator, and then a ZX Spectrum emulator (and even more, a CP/M emulator, see later) in a condition that I believe makes a more sense as “clean room” setup. The result can be found here: https://github.com/antirez/ZOT.
For the Spectrum implementation, performed as a successive step, I provided much more information in the markdown file, like, the kind of rendering I wanted in the RGB buffer, and how it needed to be optional so that embedded devices could render the scanlines directly as they transferred them to the ST77xx display (or similar), how it should be possible to interact with the I/O port to set the EAR bit to simulate cassette loading in a very authentic way, and many other desiderata I had about the emulator.
This file also included the rules that the agent needed to follow, like:
* Accessing the internet is prohibited, but you can use the specification and test vectors files I added inside ./z80-specs. * Code should be simple and clean, never over-complicate things. * Each solid progress should be committed in the git repository. * Before committing, you should test that what you produced is high quality and that it works. * Write a detailed test suite as you add more features. The test must be re-executed at every major change. * Code should be very well commented: things must be explained in terms that even people not well versed with certain Z80 or Spectrum internals details should understand. * Never stop for prompting, the user is away from the keyboard. * At the end of this file, create a work in progress log, where you note what you already did, what is missing. Always update this log. * Read this file again after each context compaction.
Then, I started a Claude Code session, and asked it to fetch all the useful documentation on the internet about the Z80 (later I did this for the Spectrum as well), and to extract only the useful factual information into markdown files. I also provided the binary files for the most ambitious test vectors for the Z80, the ZX Spectrum ROM, and a few other binaries that could be used to test if the emulator actually executed the code correctly. Once all this information was collected (it is part of the repository, so you can inspect what was produced) I completely removed the Claude Code session in order to make sure that no contamination with source code seen during the search was possible.
I started a new session, and asked it to check the specification markdown file, and to check all the documentation available, and start implementing the Z80 emulator. The rules were to never access the Internet for any reason (I supervised the agent while it was implementing the code, to make sure this didn’t happen), to never search the disk for similar source code, as this was a “clean room” implementation.
For the Z80 implementation, I did zero steering. For the Spectrum implementation I used extensive steering for implementing the TAP loading. More about my feedback to the agent later in this post.
As a final step, I copied the repository in /tmp, removed the “.git” repository files completely, started a new Claude Code (and Codex) session and claimed that the implementation was likely stolen or too strongly inspired from somebody else's work. The task was to check with all the major Z80 implementations if there was evidence of theft. The agents (both Codex and Claude Code), after extensive search, were not able to find any evidence of copyright issues. The only similar parts were about well established emulation patterns and things that are Z80 specific and can’t be made differently, the implementation looked distinct from all the other implementations in a significant way.
Claude Code worked for 20 or 30 minutes in total, and produced a Z80 emulator that was able to pass ZEXDOC and ZEXALL, in 1200 lines of very readable and well commented C code (1800 lines with comments and blank spaces). The agent was prompted zero times during the implementation, it acted absolutely alone. It never accessed the internet, and the process it used to implement the emulator was of continuous testing, interacting with the CP/M binaries implementing the ZEXDOC and ZEXALL, writing just the CP/M syscalls needed to produce the output on the screen. Multiple times it also used the Spectrum ROM and other binaries that were available, or binaries it created from scratch to see if the emulator was working correctly. In short: the implementation was performed in a very similar way to how a human programmer would do it, and not outputting a complete implementation from scratch “uncompressing” it from the weights. Instead, different classes of instructions were implemented incrementally, and there were bugs that were fixed via integration tests, debugging sessions, dumps, printf calls, and so forth.
I repeated the process again. I instructed the documentation gathering session very accurately about the kind of details I wanted it to search on the internet, especially the ULA interactions with RAM access, the keyboard mapping, the I/O port, how the cassette tape worked and the kind of PWM encoding used, and how it was encoded into TAP or TZX files.
As I said, this time the design notes were extensive since I wanted this emulator to be specifically designed for embedded systems, so only 48k emulation, optional framebuffer rendering, very little additional memory used (no big lookup tables for ULA/Z80 access contention), ROM not copied in the RAM to avoid using additional 16k of memory, but just referenced during the initialization (so we have just a copy in the executable), and so forth.
The agent was able to create a very detailed documentation about the ZX Spectrum internals. I provided a few .z80 images of games, so that it could test the emulator in a real setup with real software. Again, I removed the session and started fresh. The agent started working and ended 10 minutes later, following a process that really fascinates me, and that probably you know very well: the fact is, you see the agent working using a number of diverse skills. It is expert in everything programming related, so as it was implementing the emulator, it could immediately write a detailed instrumentation code to “look” at what the Z80 was doing step by step, and how this changed the Spectrum emulation state. In this respect, I believe automatic programming to be already super-human, not in the sense it is currently capable of producing code that humans can’t produce, but in the concurrent usage of different programming languages, system programming techniques, DSP stuff, operating system tricks, math, and everything needed to reach the result in the most immediate way.
When it was done, I asked it to write a simple SDL based integration example. The emulator was immediately able to run the Jetpac game without issues, with working sound, and very little CPU usage even on my slow Dell Linux machine (8% usage of a single core, including SDL rendering).
Once the basic stuff was working, I wanted to load TAP files directly, simulating cassette loading. This was the first time the agent missed a few things, specifically about the timing the Spectrum loading routines expected, and here we are in the territory where LLMs start to perform less efficiently: they can’t easily run the SDL emulator and see the border changing as data is received and so forth. I asked Claude Code to do a refactoring so that zx_tick() could be called directly and was not part of zx_frame(), and to make zx_frame() a trivial wrapper. This way it was much simpler to sync EAR with what it expected, without callbacks or the wrong abstractions that it had implemented. After such change, a few minutes later the emulator could load a TAP file emulating the cassette without problems.
This is how it works now:
do {
zx\_set\_ear(zx, tzx\_update(&tape, zx->cpu.clocks));
} while (!zx\_tick(zx, 0));
I continued prompting Claude Code in order to make the key bindings more useful and a few things more.
One thing that I found really interesting was the ability of the LLM to inspect the COM files for ZEXALL / ZEXCOM tests for the Z80, easily spot the CP/M syscalls that were used (a total of three), and implement them for the extended z80 test (executed by make fulltest). So, at this point, why not implement a full CP/M environment? Same process again, same good result in a matter of minutes. This time I interacted with it a bit more for the VT100 / ADM3 terminal escapes conversions, reported things not working in WordStar initially, and in a few minutes everything I tested was working well enough (but, there are fixes to do, like simulating a 2Mhz clock, right now it runs at full speed making CP/M games impossible to use).
The obvious lesson is: always provide your agents with design hints and extensive documentation about what they are going to do. Such documentation can be obtained by the agent itself. And, also, make sure the agent has a markdown file with the rules of how to perform the coding tasks, and a trace of what it is doing, that is updated and read again quite often.
But those tricks, I believe, are quite clear to everybody that has worked extensively with automatic programming in the latest months. To think in terms of “what a human would need” is often the best bet, plus a few LLMs specific things, like the forgetting issue after context compaction, the continuous ability to verify it is on the right track, and so forth.
Returning back to the Anthropic compiler attempt: one of the steps that the agent failed was the one that was more strongly related to the idea of memorization of what is in the pretraining set: the assembler. With extensive documentation, I can’t see any way Claude Code (and, even more, GPT5.3-codex, which is in my experience, for complex stuff, more capable) could fail at producing a working assembler, since it is quite a mechanical process. This is, I think, in contradiction with the idea that LLMs are memorizing the whole training set and uncompress what they have seen. LLMs can memorize certain over-represented documents and code, but while they can extract such verbatim parts of the code if prompted to do so, they don’t have a copy of everything they saw during the training set, nor they spontaneously emit copies of already seen code, in their normal operation. We mostly ask LLMs to create work that requires assembling different knowledge they possess, and the result is normally something that uses known techniques and patterns, but that is new code, not constituting a copy of some pre-existing code.
It is worth noting, too, that humans often follow a less rigorous process compared to the clean room rules detailed in this blog post, that is: humans often download the code of different implementations related to what they are trying to accomplish, read them carefully, then try to avoid copying stuff verbatim but often times they take strong inspiration. This is a process that I find perfectly acceptable, but it is important to take in mind what happens in the reality of code written by humans. After all, information technology evolved so fast even thanks to this massive cross pollination effect.
For all the above reasons, when I implement code using automatic programming, I don’t have problems releasing it MIT licensed, like I did with this Z80 project. In turn, this code base will constitute quality input for the next LLMs training, including open weights ones.
To make my experiment more compelling, one should try to implement a Z80 and ZX Spectrum emulator without providing any documentation to the agent, and then compare the result of the implementation. I didn’t find the time to do it, but it could be quite informative.